IMO-Bench: Robust Math Reasoning Benchmark

Updated 8 November 2025

IMO-Bench is a high-difficulty mathematical reasoning benchmark suite offering rigorous evaluation of AI models on curated Olympiad problems.
It employs robust problem design, automated answer and proof grading systems, and diverse stratification across algebra, combinatorics, geometry, and number theory.
The benchmark drives advances in AI reasoning by fostering genuine proof synthesis, deterring memorization, and ensuring scalable, reproducible assessments.

IMO-Bench is a high-difficulty mathematical reasoning benchmark suite designed to rigorously evaluate the capabilities of AI models, particularly LLMs, on advanced problems at the level of the International Mathematical Olympiad (IMO). It provides expert-vetted, diverse, and robustified Olympiad problems with standardized, automated, and human-aligned grading tools, enabling comprehensive assessment of both answer correctness and proof-writing abilities. Released to support progress in robust mathematical reasoning, IMO-Bench addresses core limitations of prior evaluation suites by prioritizing genuine reasoning, deterring memorization, and enabling scalable, reproducible measurement of advanced mathematical cognition (Luong et al., 3 Nov 2025).

1. Benchmark Construction and Problem Design

IMO-Bench consists of real and robustified Olympiad problems, sourced and curated by panels of IMO medalists and mathematical specialists to preclude memorization, answer-key overfitting, or reliance on surface-level cues. Each problem is modified—through paraphrasing, renaming, parameter changes, and targeted distractors—to ensure evaluation reflects genuine mathematical reasoning rather than pattern-matching. The suite stratifies problems by domain and difficulty:

Domains: Algebra, Combinatorics, Geometry, Number Theory (100 problems each in AnswerBench)
Difficulty Levels:
- Pre-IMO (easier, sub-Olympiad)
- IMO-Easy (P1 or P4)
- IMO-Medium (P2 or P5)
- IMO-Hard (P3, P6, or constructed to surpass typical Olympiad challenge)

For proof-based tasks, robustification includes not only problem restatement but selection and synthesis of proofs to ensure diversity and resistance to overfitting.

2. Structure and Core Components

The benchmark suite is organized into three principal tracks:

Suite Component	Type	Content (Scope)	Evaluation Mode
IMO-AnswerBench	Short Answer	400 problems across four disciplines, per domain/difficulty	Answer autograder
IMO-ProofBench	Full Proof	60 proof tasks (30 basic, 30 advanced) including original and robustified IMO sets	Proof autograder + human grading
IMO-GradingBench	Grading Benchmark	1000 human-graded proofs drawn from ProofBench advanced set	Automated proof grading evaluation

AnswerBench requires a definitive, verifiable short answer; ProofBench demands a complete, rigorous proof and supports nuanced grading; GradingBench focuses on evaluating the alignment of automated grading algorithms with human expert assessment.

3. Grading Frameworks and Automation Tools

To enable large-scale, standardized evaluation, IMO-Bench provides highly structured autograding systems utilizing powerful LLMs (notably Gemini 2.5 Pro):

AnswerAutoGrader: Extracts answers in multiple mathematical forms from model responses and applies mathematical equivalence normalization (e.g., set notation, algebraic relationships) to issue binary correct/incorrect judgments. Demonstrated near-perfect accuracy (>98%) compared to human experts.
ProofAutoGrader: Consumes problem, standard reference solution, candidate proof, and a detailed 0–7 scoring rubric. It produces granular grades with justification, with Pearson correlation 0.93–0.96 against expert panelists.
GradingBench Evaluation: Serves as a testbed for automated grading algorithms, reporting mean absolute error (MAE) and categorical accuracy relative to human-assigned proof scores.

The proof grading rubric is four-tiered (0, 1, 6, 7 points corresponding to incorrect, partial, almost, correct), with provisions for intermediate values to reflect nuanced progress or rigor gaps.

4. Empirical Results and Model Performance

The benchmark provides a clear, quantitative comparison landscape for state-of-the-art and open-weight models. On IMO-AnswerBench (400 questions):

Model	Accuracy (%)
Gemini Deep Think	80.0
Best non-Gemini (Grok 4)	73.1
Best open-weight (DeepSeek R1)	60.8
GPT-5	65.6
Kimi-K2-Instruct	45.8

On advanced IMO-ProofBench (30 hard proofs, human graded):

Model	Proof Score (%)
Gemini Deep Think	65.7
Best non-Gemini	23.3
GPT-5	20.0
o3	20.5
Gemini 2.5 Pro	17.6

On the proof grading task (GradingBench, categorical accuracy):

Best LLM achieves only 54.0% accuracy and MAE of 18.4%, compared to a human interrater MAE near 3.9%.

This establishes that only Gemini Deep Think attains gold-level performance on robustified Olympiad tasks, with a clear and large margin over other models, and that proof grading remains a substantial gap.

5. Methodological Advances and Robustification Strategy

IMO-Bench introduces several innovations to preserve rigor and avoid shortcut learning:

Robustification: Problems are paraphrased, variables renamed, distractors added, and question formats shifted (e.g., from “describe all” to “compute the sum”), as in the example where $x + y + z$ with constraint $16xyz=(x+y)^2(x+z)^2$ is changed to a triangle side sum problem.
Automated evaluation: Reliance on deterministic and highly reliable LLM-driven autograders decouples performance from potentially error-prone human assessment, enabling broad, reproducible experiments and model iteration.
Proof rigor: Focus on multi-step, verifiable reasoning ("how and why" not just "what"), with stepwise assessment and detailed rubrics to distinguish partial progress, conceptual error, and surface-level correctness.
Contamination resistance: Robustification and problem design approaches are maintained to mitigate effects of model pretraining leakage or overfitting on canonical question forms.

6. Scientific and Methodological Significance

IMO-Bench establishes a rigorous, reproducible, and challenging standard for robust mathematical reasoning in AI:

Difficulty: The suite exposes substantial headroom relative to established mathematical benchmarks (e.g., GSM8K, MATH, AIME), which many models now saturate.
Proof focus: The shift from answer-only evaluation to proof-writing and grading benchmarks models' capacity for stepwise abstraction and justification, central to mathematical practice.
Generalization: High-performing models must not only recall or pattern-match but synthesize valid reasoning chains and adapt to problem modifications.
Automated grading: Empirically aligned autograders enable scalable evaluation, but also highlight the remaining deficits in fine-grained proof understanding.
Research utility: By stratifying problems and providing structured datasets (problem, answer, proof, human grade), IMO-Bench supports research in autoformalization, proof synthesis, toolchain building, and explainable mathematical AI.

7. Implications and Future Directions

The publication and widespread adoption of IMO-Bench marks a significant redirection of AI mathematical reasoning research:

The suite serves as the new standard for benchmarking, surpassing GMP8K, MATH, and similar datasets in both breadth and depth.
Automated grading tools, with their close human alignment, accelerate model evaluation cycles and enable reproducible experiments at scale.
Mixed results on proof and grading tracks signal focus areas for architectural and training advances: especially reliable handling of advanced combinatorial and geometric arguments, formal proof extraction, and gap-minimization in unconstrained proofs.
Ongoing robustification and potential adversarial problem generation are necessary to sustain the relevance and contamination-resistance of the benchmark as LLM capabilities advance.

IMO-Bench is freely available to the research community at https://imobench.github.io/, forming the basis for trustworthy and forward-looking mathematical reasoning research (Luong et al., 3 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Towards Robust Mathematical Reasoning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to IMO-Bench.