IMO-Bench: Robust Math Reasoning Benchmark
- IMO-Bench is a high-difficulty mathematical reasoning benchmark suite offering rigorous evaluation of AI models on curated Olympiad problems.
- It employs robust problem design, automated answer and proof grading systems, and diverse stratification across algebra, combinatorics, geometry, and number theory.
- The benchmark drives advances in AI reasoning by fostering genuine proof synthesis, deterring memorization, and ensuring scalable, reproducible assessments.
IMO-Bench is a high-difficulty mathematical reasoning benchmark suite designed to rigorously evaluate the capabilities of AI models, particularly LLMs, on advanced problems at the level of the International Mathematical Olympiad (IMO). It provides expert-vetted, diverse, and robustified Olympiad problems with standardized, automated, and human-aligned grading tools, enabling comprehensive assessment of both answer correctness and proof-writing abilities. Released to support progress in robust mathematical reasoning, IMO-Bench addresses core limitations of prior evaluation suites by prioritizing genuine reasoning, deterring memorization, and enabling scalable, reproducible measurement of advanced mathematical cognition (Luong et al., 3 Nov 2025).
1. Benchmark Construction and Problem Design
IMO-Bench consists of real and robustified Olympiad problems, sourced and curated by panels of IMO medalists and mathematical specialists to preclude memorization, answer-key overfitting, or reliance on surface-level cues. Each problem is modified—through paraphrasing, renaming, parameter changes, and targeted distractors—to ensure evaluation reflects genuine mathematical reasoning rather than pattern-matching. The suite stratifies problems by domain and difficulty:
- Domains: Algebra, Combinatorics, Geometry, Number Theory (100 problems each in AnswerBench)
- Difficulty Levels:
- Pre-IMO (easier, sub-Olympiad)
- IMO-Easy (P1 or P4)
- IMO-Medium (P2 or P5)
- IMO-Hard (P3, P6, or constructed to surpass typical Olympiad challenge)
For proof-based tasks, robustification includes not only problem restatement but selection and synthesis of proofs to ensure diversity and resistance to overfitting.
2. Structure and Core Components
The benchmark suite is organized into three principal tracks:
| Suite Component | Type | Content (Scope) | Evaluation Mode |
|---|---|---|---|
| IMO-AnswerBench | Short Answer | 400 problems across four disciplines, per domain/difficulty | Answer autograder |
| IMO-ProofBench | Full Proof | 60 proof tasks (30 basic, 30 advanced) including original and robustified IMO sets | Proof autograder + human grading |
| IMO-GradingBench | Grading Benchmark | 1000 human-graded proofs drawn from ProofBench advanced set | Automated proof grading evaluation |
AnswerBench requires a definitive, verifiable short answer; ProofBench demands a complete, rigorous proof and supports nuanced grading; GradingBench focuses on evaluating the alignment of automated grading algorithms with human expert assessment.
3. Grading Frameworks and Automation Tools
To enable large-scale, standardized evaluation, IMO-Bench provides highly structured autograding systems utilizing powerful LLMs (notably Gemini 2.5 Pro):
- AnswerAutoGrader: Extracts answers in multiple mathematical forms from model responses and applies mathematical equivalence normalization (e.g., set notation, algebraic relationships) to issue binary correct/incorrect judgments. Demonstrated near-perfect accuracy (>98%) compared to human experts.
- ProofAutoGrader: Consumes problem, standard reference solution, candidate proof, and a detailed 0–7 scoring rubric. It produces granular grades with justification, with Pearson correlation 0.93–0.96 against expert panelists.
- GradingBench Evaluation: Serves as a testbed for automated grading algorithms, reporting mean absolute error (MAE) and categorical accuracy relative to human-assigned proof scores.
The proof grading rubric is four-tiered (0, 1, 6, 7 points corresponding to incorrect, partial, almost, correct), with provisions for intermediate values to reflect nuanced progress or rigor gaps.
4. Empirical Results and Model Performance
The benchmark provides a clear, quantitative comparison landscape for state-of-the-art and open-weight models. On IMO-AnswerBench (400 questions):
| Model | Accuracy (%) |
|---|---|
| Gemini Deep Think | 80.0 |
| Best non-Gemini (Grok 4) | 73.1 |
| Best open-weight (DeepSeek R1) | 60.8 |
| GPT-5 | 65.6 |
| Kimi-K2-Instruct | 45.8 |
On advanced IMO-ProofBench (30 hard proofs, human graded):
| Model | Proof Score (%) |
|---|---|
| Gemini Deep Think | 65.7 |
| Best non-Gemini | 23.3 |
| GPT-5 | 20.0 |
| o3 | 20.5 |
| Gemini 2.5 Pro | 17.6 |
On the proof grading task (GradingBench, categorical accuracy):
- Best LLM achieves only 54.0% accuracy and MAE of 18.4%, compared to a human interrater MAE near 3.9%.
This establishes that only Gemini Deep Think attains gold-level performance on robustified Olympiad tasks, with a clear and large margin over other models, and that proof grading remains a substantial gap.
5. Methodological Advances and Robustification Strategy
IMO-Bench introduces several innovations to preserve rigor and avoid shortcut learning:
- Robustification: Problems are paraphrased, variables renamed, distractors added, and question formats shifted (e.g., from “describe all” to “compute the sum”), as in the example where with constraint is changed to a triangle side sum problem.
- Automated evaluation: Reliance on deterministic and highly reliable LLM-driven autograders decouples performance from potentially error-prone human assessment, enabling broad, reproducible experiments and model iteration.
- Proof rigor: Focus on multi-step, verifiable reasoning ("how and why" not just "what"), with stepwise assessment and detailed rubrics to distinguish partial progress, conceptual error, and surface-level correctness.
- Contamination resistance: Robustification and problem design approaches are maintained to mitigate effects of model pretraining leakage or overfitting on canonical question forms.
6. Scientific and Methodological Significance
IMO-Bench establishes a rigorous, reproducible, and challenging standard for robust mathematical reasoning in AI:
- Difficulty: The suite exposes substantial headroom relative to established mathematical benchmarks (e.g., GSM8K, MATH, AIME), which many models now saturate.
- Proof focus: The shift from answer-only evaluation to proof-writing and grading benchmarks models' capacity for stepwise abstraction and justification, central to mathematical practice.
- Generalization: High-performing models must not only recall or pattern-match but synthesize valid reasoning chains and adapt to problem modifications.
- Automated grading: Empirically aligned autograders enable scalable evaluation, but also highlight the remaining deficits in fine-grained proof understanding.
- Research utility: By stratifying problems and providing structured datasets (problem, answer, proof, human grade), IMO-Bench supports research in autoformalization, proof synthesis, toolchain building, and explainable mathematical AI.
7. Implications and Future Directions
The publication and widespread adoption of IMO-Bench marks a significant redirection of AI mathematical reasoning research:
- The suite serves as the new standard for benchmarking, surpassing GMP8K, MATH, and similar datasets in both breadth and depth.
- Automated grading tools, with their close human alignment, accelerate model evaluation cycles and enable reproducible experiments at scale.
- Mixed results on proof and grading tracks signal focus areas for architectural and training advances: especially reliable handling of advanced combinatorial and geometric arguments, formal proof extraction, and gap-minimization in unconstrained proofs.
- Ongoing robustification and potential adversarial problem generation are necessary to sustain the relevance and contamination-resistance of the benchmark as LLM capabilities advance.
IMO-Bench is freely available to the research community at https://imobench.github.io/, forming the basis for trustworthy and forward-looking mathematical reasoning research (Luong et al., 3 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free