IMO-ProofBench: Benchmarking AI Mathematical Proofs
- IMO-ProofBench is a benchmark for evaluating AI’s ability to construct rigorous, multi-step proofs on challenging Olympiad-level mathematics problems.
- It comprises 60 curated problems—30 basic and 30 advanced—spanning algebra, combinatorics, geometry, and number theory to prevent overfitting and memorization.
- The grading system integrates detailed human evaluation and automated scoring via ProofAutoGrader, ensuring alignment with expert mathematical reasoning standards.
IMO-ProofBench is a mathematical reasoning benchmark explicitly constructed to evaluate the proof-writing and stepwise deductive capabilities of AI models on International Mathematical Olympiad (IMO)–level problems. Designed and vetted by specialists, it consists of both basic and advanced Olympiad proof problems, coupled with a robust grading rubric and automated assessment tools. IMO-ProofBench addresses core limitations of prior benchmarks by demanding multi-step, rigorous proof construction rather than mere answer extraction, setting a high bar for evaluating foundation models and AI systems intended for advanced mathematics.
1. Structure and Composition of IMO-ProofBench
IMO-ProofBench comprises a curated set of 60 proof-based Olympiad problems, divided equally into Basic and Advanced subsets:
- Basic Subset (30 problems): Encompasses problems ranging from pre-IMO difficulty to IMO-medium, often rephrasings of known Olympiad questions and intended to be accessible to current LLMs.
- Advanced Subset (30 problems): Composed of difficult, novel, or modified IMO/USAMO-level problems, including 18 new problems written by medalists and 12 “robustified” recent real-competition questions. These advanced problems are crafted to avoid overfitting and memorization, ensuring that models are evaluated on reasoning and not on pattern recognition or retrieval.
Each problem is presented as an open-ended proof task (not as multiple choice or answer-only), covering primary Olympiad domains—algebra, combinatorics, geometry, and number theory. Problems are sometimes further “robustified” by modifications (e.g., changes in constants, structure, or context) to ensure that shallow solution strategies cannot succeed.
2. Grading Rubric and Human Evaluation Protocols
Proofs are evaluated using a high-fidelity scoring rubric inspired by IMOs standard, providing both integer and interval-granular scores on a [0, 7] scale:
| Category | Points | Solution Quality |
|---|---|---|
| Correct | 7 | Fully correct, rigorous, and complete |
| Almost | 6 | Almost correct, minor errors |
| Partial | 1 | Mostly incorrect, some relevant correct work |
| Incorrect | 0 | Completely incorrect or irrelevant |
Human graders can assign any integer score from 0 to 7 for finer discrimination. Each grader references detailed schemes, including criteria for edge cases and partial progress, adapted from official IMO guidelines. Correctness is based not only on eventual answer, but logical validity, completeness of argument, treatment of edge cases, and justification of all steps. This protocol ensures that both missing generality and unjustified steps are strictly penalized.
3. Automatic Proof Grading: ProofAutoGrader and IMO-GradingBench
Given the scale and complexity of evaluating informal, stepwise mathematical proofs, an automated grading pipeline (“ProofAutoGrader”) is introduced. This pipeline uses Gemini 2.5 Pro as a backbone with the following operational design:
- Inputs: candidate proof, problem statement, a reference solution, and detailed grading guidelines.
- Procedure: The model acts as a strict evaluator, outputting a score on the standard rubric.
- Empirical validation: ProofAutoGrader’s scores yield high Pearson correlations with human experts (0.93–0.96), and its confusion matrix shows that almost all disagreements are between “Partial” and “Incorrect,” not “Correct” and “Incorrect.”
Additionally, the IMO-GradingBench supplement provides 1,000 human-graded proof solutions on Advanced IMO-ProofBench problems, supporting meta-evaluation of models’ grading capabilities. The best models achieve a mean absolute error (MAE) of 18.4% in matching human graders, indicating reasonably close alignment especially when context and grading guidance are present.
4. Model Performance and Empirical Findings
IMO-ProofBench exposes the significant gap between current SOTA models and robust mathematical reasoning requirements. Key quantitative highlights:
- Gemini Deep Think (Gold): Achieves 89% on Basic and 65.7% on Advanced ProofBench (human-graded).
- Next-best (GPT-5): 59% on Basic, 20% on Advanced.
- Best open-weight model: 7.1% on Advanced.
- Other models (e.g., Claude Opus 4, DeepSeek V3, Qwen3-235B): Generally below 5% on Advanced.
The majority of models fail to provide complete, general, and rigorously justified proofs on the Advanced track; common errors include guessing final answers, considering only special cases, omitting key steps, and failing to abstract sufficiently.
5. Robustification, Problem Diversity, and Implications for Generalization
Robustification of advanced problems is central: modifications to problem statements defeat answer-matching and force reasoning on new ground. The selection spans the full spectrum of Olympiad topics, including combinatorics, inequalities, functional equations, number theory, and synthetic geometry, ensuring that superficial mastery of patterns cannot substitute for mathematical insight. This diversity strongly penalizes memorization and shallow template-matching.
The suite is particularly effective in revealing “proof gaps”—cases where models can provide plausible but incomplete or non-general proofs, which might pass binary answer-based checks but fail at rigorous level. This is demonstrated by the large performance gap observed between answer-based and proof-based model assessments, and by the extensive “partial” scoring on human-graded runs.
6. Benchmark Impact and Future Research Directions
IMO-ProofBench represents a significant advancement for benchmarking in mathematical AI:
- Rigor and Reasoning Focus: Shifts evaluation from answer-getting to deep chain-of-reasoning, robustly distinguishing between models that can guess correct values and those capable of constructing convincing mathematical arguments.
- Grading Infrastructure: Provides validated tools and large-scale human evaluation traces, supporting scalable, reproducible, and robust benchmarking.
- Proof Evaluation as a Research Area: With IMO-GradingBench and high-correlation auto-grading, it creates a closed loop for both evaluating and training grading models, essential for reward modeling and reinforcement learning on proofs.
- Empirical Clarity: Clearly demonstrates that even models excelling at short-answer tasks (e.g., 80%+ on answer-only IMO-Bench) are currently far from mastery at full proof tasks on Olympiad-level mathematics.
- Opens New Challenges: Future advances will likely require innovations in proof planning, robust self-verification, auto-formalization, and integration of neural and symbolic reasoning.
7. Example Problem Formats and Grading Samples
Problems are presented with explicit, technical, and nontrivial mathematical content. For example:
1 2 3 4 5 |
For a real number %%%%0%%%%, let %%%%1%%%% denote the fractional part of %%%%2%%%% in its decimal representation. For a real number %%%%3%%%% and a positive integer %%%%4%%%%, define %%%%5%%%% as %%%%6%%%% Find all positive real numbers %%%%7%%%% such that %%%%8%%%% is a multiple of %%%%9%%%% for all positive integers %%%%10%%%%. |
Proofs are awarded points per the standardized rubric given above, with human and automated graders agreeing closely except in partial-vs-incorrect categories. The design explicitly supports scalable, robust, and deterministic evaluation, supporting not just answer correctness, but the ability to sustain logical validity across potentially long and creative chain-of-thought solutions.
IMO-ProofBench thus sets the current bar for evaluating the mathematical reasoning and proof-writing capacities of AI, with carefully chosen, challenging Olympiad-level content, rigorous multi-level grading aligned with expert practice, and infrastructure for both human and automated assessment. It reveals the stark difference between answer-level and proof-level mathematical ability, and sets a clear agenda for future progress in foundation model mathematics (Luong et al., 3 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free