IMO-AnswerBench: Olympiad Math Evaluation

Updated 4 November 2025

IMO-AnswerBench is a specialized benchmarking suite that evaluates IMO-level mathematical reasoning through 400 handpicked problems across algebra, combinatorics, geometry, and number theory.
The benchmark incorporates robustification techniques—such as entity renaming and numerical adjustment—to prevent model memorization and enforce unique, nontrivial solutions.
It leverages a fully automated grading system with near-perfect human grading correlation, revealing significant performance gaps among cutting-edge language models.

IMO-AnswerBench is a specialized benchmarking suite for evaluating mathematical reasoning and problem-solving ability at the International Mathematical Olympiad (IMO) level. It is engineered to overcome limitations of existing mathematical AI evaluation frameworks, especially in model discrimination, robustness to memorization, and automated grading rigor, by focusing on challenging contest problems with unique, verifiable answers and strict autograding protocols.

1. Benchmark Design, Motivation, and Problem Selection

IMO-AnswerBench was developed to address critical weaknesses in legacy datasets such as GSM8K, MATH, and AIME, whose benchmarks have become saturated and no longer differentiate top-performing LLMs. Unlike these "final answer" benchmarks, IMO-AnswerBench features 400 handpicked Olympiad problems—drawn from national, regional, and top international competitions—distributed equally across four mathematical domains: algebra, combinatorics, geometry, and number theory (Luong et al., 3 Nov 2025). Each domain contains 100 problems, balanced in difficulty across “Pre-IMO,” “IMO-Easy,” “IMO-Medium,” and “IMO-Hard” levels. Problems are selected and robustified by leading specialists to enforce diversity and prevent solution recall via training set memorization.

Category	Pre-IMO	IMO-Easy	IMO-Medium	IMO-Hard
Algebra	11	46	32	11
Combinatorics	4	19	31	46
Geometry	13	44	32	11
Number Theory	2	20	31	47

Design features include comprehensive topic representation, robustification through paraphrasing and distractor augmentation, and problem reformation to ensure a unique, nontrivial answer—addressing prior grading noise and the vulnerability of models to template-based guessing.

2. Robustification and Uniqueness

To prevent models from leveraging superficial memorization, each problem in IMO-AnswerBench undergoes robustification: entity renaming, numerical adjustment, restructuring, and insertion of contextual distractors (Luong et al., 3 Nov 2025). Manual and automatic rewriting ensures that problems deviate from canonical forms found in existing solution datasets. Unique-answer formatting is enforced via problem restatement, often translating general solution requests into questions demanding explicit computed quantities (e.g., “sum of all values” instead of “find all values”).

Example Original	Example Robustified
Find all $m$ such that…	Find the sum of all $m$ for $2 \leq m \leq 2000$

Performance drop from original to robustified forms empirically validates increased problem difficulty and benchmark resilience to memorization strategies.

3. Evaluation and Automated Grading Methodology

All IMO-AnswerBench problems are short-answer format, typically with a unique combinatorial, algebraic, or numeric solution. Grading is fully automated via the AnswerAutoGrader framework, which extracts and assesses the model’s final answer for semantic equivalence with the gold solution (Luong et al., 3 Nov 2025). The autograder, powered by LLM judgment, is designed to recognize equivalence beyond simple string matching (e.g., set-theoretic and numeric equality), and disregards partial credit for incomplete or partially correct answers.

Autograder reliability is validated by near-perfect correlation with human grading (98.9% agreement). Example prompt fragment:

"Your sole function is to determine if the final answer provided in the Model Solution is mathematically equivalent to the Golden Answer. Equivalence is mandatory for a correct grade. Rigorous reasoning about numerical and algebraic equivalence ... No partial credit."

Strict output format and extraction ensures scalability and objectivity; grading noise from symbolic, interval, or ambiguous answer types is eliminated.

4. Quantitative Results and Model Performance

Benchmarking reveals substantial gaps between cutting-edge models and IMO-level performance. On IMO-AnswerBench, the Gemini Deep Think model achieves 80.0% accuracy—surpassing non-Gemini SOTA (Grok 4) by 6.9% and open-weight leaders by 19.2%. Lower scores are observed in combinatorics, indicating persistent weaknesses in advanced reasoning domains.

Model	Algebra	Combinatorics	Geometry	Number Theory	Overall
Gemini 2.5 Deep Think	78.0%	49.0%	83.0%	77.0%	71.8%
IMO-gold (Gemini)	85.0%	69.0%	88.0%	78.0%	80.0%
Grok 4	75.5%	55.9%	80.1%	80.9%	73.1%
GPT-5	69.9%	46.4%	74.8%	71.2%	65.6%

Robustification reliably reduces scores across all models, confirming the benchmark’s resistance to pattern-based solution extraction.

IMO-AnswerBench’s focus on unique short answers and auditability contrasts with earlier IMO-centric benchmarks such as OlymMATH, Omni-MATH, and AIMO, which suffer from answer format heterogeneity, grading noise, and non-deterministic correctness assessment (Chen et al., 9 Sep 2025). RIMO-N employs deterministic grading by enforcing single-integer answers on IMO problems, while RIMO-P decomposes proofs into scored substeps. Both highlight pronounced drops (to 33–63%) in SOTA LLM performance compared to classical benchmarks (GSM8K/MATH, >90%), with negligible correlation between numeric and proof-writing accuracy.

The formal reasoning benchmarks CombiBench (Liu et al., 6 May 2025) and the Lean “IMO-Steps” dataset (Yousefzadeh et al., 28 Nov 2024) reveal that even elite LLMs fine-tuned for theorem proving struggle on combinatorics and advanced formalization, with pass@16 scores below 7% for the hardest Lean problems.

6. Impact, Limitations, and Future Directions

IMO-AnswerBench establishes a robust and reproducible North Star for the community, enabling large-scale, objective discrimination between frontier mathematical models. The advancements in robustification, answer standardization, and high-correlation autograding methodologies set a new bar for future datasets targeting agentic, generalizing mathematical intelligence (Luong et al., 3 Nov 2025). These attributes facilitate rapid and reproducible evaluation, bypassing human resource bottlenecks, and revealing persistent challenges in agentic search, long-horizon reasoning, and combinatorial creativity.

Major advances are needed in mathematical generalization (as demonstrated by the difficulty of Combinatorics), cross-domain transfer, and reliable proof-writing. IMO-AnswerBench informs ongoing development in the automated evaluation of reasoning agents, AI peer grading, and the next generation of self-verifying and proof-generating models. The continuous release of problem data, grading tools, and gold solutions (see https://imobench.github.io) accelerates progress towards closing the observed reasoning gap.

7. Sample Problem and Illustrative Reasoning

A representative robustified algebra problem from Table~\ref{tab:imo-answer-bench-examples}:

Let $a, b, c$ be positive real numbers satisfying $a^2 b^2 = 2(a+b-c)(b+c-a)(c+a-b)$ . Find the maximum value for $a+b+c$ .

This problem exemplifies the suite’s emphasis on unique answer generation, transformation of standard contest queries, and challenge to both symbolic and numeric reasoning. Automated grading protocols demand full correctness for equivalence, not mere syntactic matches.

IMO-AnswerBench marks a significant transition from answer-matching benchmarks to robust, reasoning-centered evaluation for mathematical AI, with stringent requirements on diversity, answer uniqueness, grading quality, and resistance to memorization. Its deployment, results, and methodological innovations form the foundation for objective progress tracking and targeted research on advanced mathematical reasoning systems at the Olympiad level.