Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hard2Verify: Step-wise Proof Verification

Updated 2 June 2026
  • Hard2Verify is a comprehensive benchmark for verifying each step of advanced mathematical proofs, emphasizing correctness and precise error localization.
  • It employs human annotation and rigorous evaluation protocols to label individual solution steps as correct or incorrect in Olympiad-level and research-style problems.
  • The benchmark drives advances in AI alignment by enhancing verifier calibration through methodologies like latent steering and pseudo-formalization.

Hard2Verify is a human-annotated, step-level mathematical proof verification benchmark designed to rigorously evaluate the capabilities of automated verifiers—particularly LLMs—in identifying both correctness and error localization at each step within complex, open-ended mathematical arguments. It addresses critical gaps in the evaluation and training of proof-verifying systems on state-of-the-art, Olympiad-level and research-style mathematics, where inference steps must be scrutinized for logical and mathematical soundness in the absence of readily available ground-truth answers (Pandit et al., 15 Oct 2025). Hard2Verify has rapidly established itself as a central resource for both benchmarking and driving advances in step-wise mathematical reasoning, proof-checking, and AI alignment for mathematical agents.

1. Motivation and Objectives

The central motivation for Hard2Verify arises from recent advances in LLMs that have attained or surpassed human performance on standard math benchmarks (GSM8K, MATH, AIME), but where model evaluation focuses almost exclusively on final-answer correctness. In frontier mathematical domains such as multi-step Olympiad or research proofs, correctness is inherently open-ended: there is no concise ground-truth to compare against, and every inference step must be locally validated. Existing benchmarks—MR-GSM8K, ProcessBench, PRMBench—either emphasize artificially simplistic or synthetic problems, or inject errors that do not reflect natural missteps of state-of-the-art models. There existed no large-scale dataset of challenging, human-verified step-wise outputs arising from actual frontier LLMs.

Hard2Verify explicitly targets this regime by collecting difficult contemporary competition problems and actual multi-step solutions produced by models such as GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4. Expert human annotators label every solution step as “correct” or “incorrect,” providing a gold standard for step-level verification. This enables the systematic training and evaluation of verifiers with direct relevance to real-world, high-stakes mathematical reasoning scenarios (Pandit et al., 15 Oct 2025, Zhou et al., 20 May 2026).

2. Dataset Construction

Hard2Verify’s dataset construction spans four tightly controlled stages:

  1. Question Curation: Eighty high-difficulty problems are drawn from ten leading mathematics competitions (IMO 2023–2025, Putnam, EGMO, INMO, USAMO, BMO, CMO, USA JMO), using LaTeX-extracted (MathPix) statements and excluding image-based questions.
  2. Response Generation: Each problem is presented to three top-tier LLMs (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4; fixed prompt, identical decoding, no external tools), producing 200 high-quality model-generated solutions after filtering degenerate/terse outputs.
  3. Annotation Workflow: 52 annotators (35 with advanced math degrees, affiliated with Turing research accelerator) split each solution into atomic steps. Four annotation rounds—initial labeling plus three reviews—culminate in expert adjudication. Annotators reference official solutions, computational tools, and standardized guidelines.
  4. Quality Control: The annotation effort exceeds 500 hours, grading 1,860 steps (spanning 200 responses). Of these, 58% are labeled “correct” (1,080 steps) and 42% “incorrect” (780 steps), ensuring broad coverage of both successful and erroneous reasoning.

The resulting dataset offers per-step and per-response labels, first-error localization, and supports both granular and holistic evaluation across varied verifier architectures (Pandit et al., 15 Oct 2025).

3. Formal Task Definition and Evaluation Protocol

The step-level verification framework formalizes the task as follows. Given a problem statement QQ and ordered solution steps S=(s0,s1,...,sn1)S = (s_0, s_1, ..., s_{n-1}), a verifier’s goal is to produce:

  • Step-level correctness labels: fstep(Q,S)(y0,y1,...,yn1)f_{\mathrm{step}}(Q, S) \mapsto (y_0, y_1, ..., y_{n-1}), with yi{yes,no}y_i \in \{\mathrm{yes}, \mathrm{no}\}
  • First error index: ferror(Q,S)kf_{\mathrm{error}}(Q, S) \mapsto k, with k{1,0,...,n1}k \in \{-1, 0, ..., n-1\} (k=1k = -1 signals no error)
  • Global response verdict: correct iff all step labels are “yes”

Evaluation protocol covers three primary tasks:

  • Step-Level Correctness: Accurate labeling of each solution step.
  • Response-Level Correctness: Boolean verdict on the global correctness.
  • First Error Identification: Pinpointing the initial erroneous step.

Metrics are defined as follows, using TPR (true positive rate: fraction of correct steps/responses marked correct) and TNR (true negative rate: fraction of incorrect steps marked incorrect):

BalancedAccuracy=TPR+TNR2,BalancedF1=2TPRTNRTPR+TNR\mathrm{BalancedAccuracy} = \tfrac{\mathrm{TPR} + \mathrm{TNR}}{2}, \qquad \mathrm{BalancedF1} = \tfrac{2\,\mathrm{TPR}\,\mathrm{TNR}}{\mathrm{TPR}+\mathrm{TNR}}

Annotations are zero-indexed; generative verifiers follow fixed prompts, while process reward models (PRMs) yield continuous scores, thresholded for discrete decisions (Pandit et al., 15 Oct 2025, Zhou et al., 20 May 2026).

4. Benchmarked Models and Experimental Results

Hard2Verify benchmarks 29 distinct verifiers:

  • Closed-source generative critics: GPT-5, GPT-5-Mini, o3, o4-Mini, GPT-4.1, Gemini 2.5 Pro, Claude Sonnet 4
  • Open-weight generative critics (large, ≥70B): Kimi K2, DeepSeek-R1, various Qwen and GLM models, gpt-oss-120B, Llama-3.3-70B-Instruct
  • Open-weight generative critics (<70B): Several Qwen and ByteDance/Seed-OSS models
  • Process reward models (PRMs): Qwen2.5-Math-PRM-72B, Qwen2.5-Math-PRM-7B, Skywork-PRM, ReasonFlux-PRM, UniversalPRM

Key findings include:

  • Closed-source critics outperform open-source: GPT-5 achieves ≈86.5% balanced accuracy on step-level, compared to ≈78% for leading open-weight models (e.g., gpt-oss-120B).
  • PRMs underperform: Open-source PRMs show poor error identification, with ErrorID task balanced accuracy well below random chance.
  • Error detection is the performance bottleneck: Most models achieve high TPR (finding correct steps), but are insensitive to incorrect steps (TNR ≈ 0 for weaker models).
  • Inference scaling: “Think-longer” (sequential detailed critique) increases step-level F1 (e.g., gpt-oss-120B from 61.5% to 74.6%), while “best-of-N” parallel resampling yields negligible improvement, indicating sequential depth is critical.
  • Self-verification dynamics: Strong models more effectively flag their own errors; all models reliably detect Claude Sonnet 4 errors and struggle most on Gemini 2.5 Pro (Pandit et al., 15 Oct 2025, Zhou et al., 20 May 2026).

A table summarizing several model categories and performance metrics appears below:

Model Category Step-Level Balanced Accuracy Notable Notes
GPT-5 (closed) ≈86.5% Highest-performing
gpt-oss-120B (open, large) ≈78% Leading open-weight
PRMs (open-source) <50% (often much lower) Poor error detection

5. Methodological Advances and Extensions

Hard2Verify catalyzed development of advanced verifier control and modularization strategies:

  • Verifier Strictness and Latent Steering: Studies demonstrate that generative verifiers often suffer a positivity bias, frequently accepting flawed proofs. Uniform prompt-based strictness increases TNR (error detection) but reduces TPR (correct proof certification), producing a trade-off. VerifySteer, a latent-state steering method, adaptively controls strictness at both sample and token (paragraph boundary) level via learned activation vectors—without retraining. Applied to Qwen3-8B-thinking and FARE-20B verifiers, VerifySteer raises F₁ by up to 9.6 points over base models while requiring substantially less inference compute than self-consistency approaches (Zhou et al., 20 May 2026).
  • Pseudo-Formalization and Block Verification: Recognizing limitations of LLMs on informal natural-language proofs, the PF+BV framework structures proofs into self-contained modules, each with explicit premises, conclusion, and local argument. Verification proceeds module-by-module, with a final aggregator calibrating strictness for the overall verdict. On Hard2Verify, PF+BV achieves higher recall and ≈20% fewer false alarms than LLM-as-judge baselines, with increased proof-level coverage (Barkallah et al., 19 May 2026). The approach is underpinned by the property that block-wise verification requires only short-context accesses, avoiding the scaling challenges of monolithic proof analysis.
  • Proposed Extensions: End-to-end PF module emission, hybrid autoformalization (partial Lean/Isabelle compilation), learned adaptive module sizes, graph-neural global oversight, and entailment-based calibration are all proposed to further increase Hard2Verify applicability and robustness (Barkallah et al., 19 May 2026).

6. Limitations, Failure Modes, and Cost Considerations

Known limitations of Hard2Verify and PF+BV methods include:

  • Translation fidelity: Pseudo-Formalization can fail for highly informal or terse arguments; translation hallucination and omission errors require further mitigation mechanism.
  • Domain specificity and annotation bias: Evaluations are presently limited to pure mathematics; not all errors in research-level proofs (as in ArxivMathGradingBench) may be annotated, potentially underestimating false positive rate.
  • Calibrator trust: Aggregation of module verdicts depends on the calibration LLM, potentially susceptible to prompt misconfiguration.
  • Cost: Modular methods (PF+BV at k=8k=8) run approximately 8× more expensive than single LLM-judge baselines on Hard2Verify (baseline ≈$30 vs PF+BV ≈$264 for 200 proofs); latent steering methods such as VerifySteer are more cost-efficient than self-consistency voting (Barkallah et al., 19 May 2026, Zhou et al., 20 May 2026).

Failure modes include over- or under-modularization (affecting interface consistency or context window fit), and hallucinations in proof structure translation.

7. Impact and Implications for AI Alignment and Mathematical Reasoning

Hard2Verify establishes the first rigorous, high-difficulty, human-verified step-wise benchmark for proof verification, addressing a central obstacle in deploying LLM reasoners for mathematical discovery or automated grading. It exposes the persistent challenges in error localization, the gap between front-line generative models and reliable verification, and reveals opportunities for improving the calibration and modularity of verifier architectures.

Notably, Hard2Verify supports important applications in:

  • Training RL from Verifiable Rewards (RLVR) on open-ended proofs,
  • Exploring hybrid and ensemble verifier strategies to trade off recall (error catching) and precision (avoiding false alarms),
  • Refining prompt and activation steering strategies for both open and closed models.

Robust step-level verification on benchmarks such as Hard2Verify is a foundational requirement for scalable oversight and alignment of LLMs with the standards of mathematical rigor demanded by open research and competition mathematics (Pandit et al., 15 Oct 2025, Zhou et al., 20 May 2026, Barkallah et al., 19 May 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hard2Verify.