Arxiv Math Grading Benchmark

Updated 2 June 2026

ArxivMathGradingBench is an open research benchmark that evaluates automated proof verification in advanced mathematics using genuine, author-corrected errors from arXiv papers.
The dataset comprises 35 research-level papers with 40 annotated errors, covering diverse subfields like number theory, analysis, and algebraic geometry.
Evaluation protocols compare methods such as PF+BV and LLM-as-judge using precision, recall, F1 scores, and false alarm counts for rigorous error detection.

ArxivMathGradingBench is an open, research-level mathematics benchmark designed for the fine-grained evaluation of proof verification—specifically, AI-based error detection in published mathematical papers. It enables rigorous, data-driven assessment and statistical comparison of automated verification systems operating at the frontier of mathematical research, where proofs exhibit high complexity, advanced techniques, and domain-specific notation. The benchmark addresses critical evaluation bottlenecks in scaling automated reasoning and proof verification by providing real-world test cases derived from the arXiv mathematics corpus, annotated with ground-truth error locations confirmed and corrected by paper authors (Barkallah et al., 19 May 2026).

1. Dataset Scope, Composition, and Sources

ArxivMathGradingBench comprises 35 research-level mathematics papers representing a cross-section of contemporary mathematical research. Papers are sourced exclusively from the primary "math.*" categories on arXiv (e.g., math.NT, math.CV, math.DS, math.AG) over the period January–July 2025. Selection is anchored to revision comments that explicitly mention corrections of flawed theorems, lemmas, or propositions, ensuring that all collected errors are genuine mathematical, rather than purely editorial or typographical, faults.

The dataset contains 40 author-corrected errors—each linked to specific locations in the mathematical exposition (e.g., "Lemma 4.1," "Theorem 1.5") as reported by the original authors in arXiv revision logs. All included proofs are research-level, often spanning multiple pages and leveraging advanced techniques and domain-specific constructs. The distribution of topics is approximately as follows:

Area	≈ % of Papers
Number Theory (math.NT)	30 %
Complex/Real Analysis & PDE	25 %
Dynamical Systems (math.DS)	15 %
Algebraic/Differential Geometry	20 %
Other (topology, mathematical physics)	10 %

Proofs are uniformly at or above the graduate research level and frequently invoke nontrivial, literature-derived lemmas.

2. Collection, Annotation, and Data Format

Data curation follows a rigorous filtering and annotation pipeline:

Paper acquisition uses regex filtering on revision comments to isolate mathematics arXiv papers with explicit substantive error corrections referencing a theorem, lemma, or proposition. Non-mathematical corrections (typos, exposition) are filtered out by a lightweight GPT-5.4 nano classifier.
Error annotation leverages verbatim author-provided references to flawed results. Each benchmark entry includes: (1) the pre-correction raw LaTeX source, (2) corresponding compiled PDF, and (3) a list of error locations as rendered PDF labels (e.g., "Theorem 1.5").
Data model: Published as a HuggingFace dataset, each example is a JSON object of the form: $\widehat{Y}$ 2

There are no "negative" locations: only known, author-identified faults are labeled, but other locations may also harbor latent errors.

3. Benchmark Structure, Labeling, and Evaluation Protocol

Each paper is structured as a set of potential proof units, with ground-truth error locations $Y = \{ y_1, \ldots, y_k \}$ identified from author revisions. Error prediction systems must return a predicted set $\widehat{Y}$ , matched against $Y$ at the level of PDF-rendered strings, ensuring precise, human-aligned evaluation.

The evaluation workflow specifies:

Input: LaTeX source and compiled PDF for a single paper.
Output: $\widehat{Y}$ , a set of predicted erroneous locations.
Metrics (per paper and globally aggregated):
- True positives (TP): $|\widehat{Y} \cap Y|$
- False positives (FP): $|\widehat{Y} \setminus Y|$
- False negatives (FN): $|Y \setminus \widehat{Y}|$
- Precision, recall, and $F_1$ as usual:
$\mathrm{Precision} = \frac{TP}{TP+FP} \qquad \mathrm{Recall} = \frac{TP}{TP+FN} \qquad F_1 = \frac{2\cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

Since ground-truth labels are incomplete, precision is a lower bound on true precision.

Block matching uses LLM-assisted string comparison to align predicted with ground-truth error locations at the level of rendered PDF labels.

4. Experimental Results and Comparison Methodology

The benchmark provides comparative results for two error-finding methodologies:

PF+BV (Pseudo-Formalization + Block Verification): Translates each proof into modular, self-contained blocks, then independently verifies each using LLMs. Marked as erroneous if any verification rollout flags a location as faulty.
LLM-as-judge baseline: Direct error-prediction by a single call to a GPT-5.4-mini model, accepting LaTeX and PDF as input.

Key empirical results at $k=8$ verification rollouts are summarized:

Method	k	Precision	Recall	$\widehat{Y}$ 0	False-alarms/paper	Coverage (all errors found)
LLM-as-judge	8	18 %	38 %	25 %	2.4	30 %
PF + BlockVerify	8	22 %	52 %	31 %	1.8	45 %

PF+BV Pareto-dominates the baseline across all tradeoffs between recall and precision. Increasing $\widehat{Y}$ 1 increases recall but decreases precision. The number of false alarms (predicted locations not matching author labels) is consistently lower for PF+BV at comparable recall.

Empirical findings indicate that PF+BV identifies a wide variety of proof failures, such as computational mistakes, incorrect case analyses, and erroneous lemma application. However, errors that span multiple logical blocks or depend on deep, unformalized background are more likely to be missed (Barkallah et al., 19 May 2026).

5. Error Types, Limitations, and Domain-Specific Challenges

Detected errors: Localized misapplications of lemmas, unjustified generalizations, computational oversights, and case analysis failures.
Missed errors: Multi-lemma arguments with errors spanning independently-checked blocks, or dependencies on implicit mathematical background not formalized in the proof structure.
Domain challenges: Specialized subfields (e.g., moduli of sheaves, advanced homotopy theory) with heavy usage of technical notation can impede both translation to the pseudo-formal structure and effective verification.
Label limitations: Absence of negative labels implies that false-positive rates are only upper bounds; unannotated locations may in fact be erroneous.

This suggests that evaluation results must be interpreted as lower bounds on precision and coverage, and some structurally complex errors will require further methodological innovation.

6. Usage, Extensions, and Future Directions

ArxivMathGradingBench is distributed as a HuggingFace dataset, with provided schemas (JSON for inputs, XML for error lists) and complete pre- and post-correction paper sources. This structure supports reproducibility, extensibility, and integration into broader evaluation protocols.

Identified avenues for future work include:

Training models for native emission of pseudo-formal proof structure, reducing translation errors.
Augmenting the benchmark with human-audited false alarm analyses and additional error types.
Extending coverage to adjacent fields (e.g., theoretical computer science, algorithmic arguments).
Developing hybrid autoformalization pipelines that draw on existing proof assistants (e.g., Lean).
Introducing finer calibration of error detection thresholds ("strict" vs. "lenient" modes) for tailored downstream applications.

A plausible implication is that, by standardizing evaluation on real-world, author-validated proof errors, ArxivMathGradingBench provides a scalable pathway for the principled assessment and training of advanced proof-verification systems in mathematical research contexts (Barkallah et al., 19 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Pseudo-Formalization for Automatic Proof Verification (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArxivMathGradingBench.

Arxiv Math Grading Benchmark

1. Dataset Scope, Composition, and Sources

2. Collection, Annotation, and Data Format

3. Benchmark Structure, Labeling, and Evaluation Protocol

4. Experimental Results and Comparison Methodology

5. Error Types, Limitations, and Domain-Specific Challenges

6. Usage, Extensions, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Arxiv Math Grading Benchmark

1. Dataset Scope, Composition, and Sources

2. Collection, Annotation, and Data Format

3. Benchmark Structure, Labeling, and Evaluation Protocol

4. Experimental Results and Comparison Methodology

5. Error Types, Limitations, and Domain-Specific Challenges

6. Usage, Extensions, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research