AI Mathematical Olympiad (AIMO)

Updated 14 July 2025

AI Mathematical Olympiad (AIMO) is a research initiative and competition framework that benchmarks AI systems on high-level mathematical reasoning using formalized proofs and diverse datasets.
It integrates state-of-the-art datasets, formal methods, and neuro-symbolic techniques to drive advancements in solving International Mathematical Olympiad problems.
The initiative fosters innovations in automated reasoning, supports reproducible evaluation, and paves the way for enhanced tools in mathematical education and formal verification.

The AI Mathematical Olympiad (AIMO) is an ambitious research initiative and competition framework aimed at benchmarking, advancing, and ultimately achieving artificial intelligence systems capable of solving International Mathematical Olympiad (IMO) problems at or above the level of human gold medalists. AIMO encompasses a suite of challenges, benchmarking protocols, datasets, formalization and solution-generation tasks, and training methodologies specifically targeted toward automating the high-level mathematical reasoning required for Olympiad-level mathematics. It synthesizes progress from formal methods, symbolic reasoning, deep learning, dataset creation, solution evaluation, and competition-based model assessment.

1. Motivation and Historical Context

AIMO was conceived as a grand challenge in the AI community, with the explicit goal of building an AI system that can win a gold medal at the International Mathematical Olympiad. This paradigm represents not just problem-solving automation but a comprehensive test of formal reasoning, creativity, and generalization on mathematically rigorous and human-originating benchmarks (2010.16015). Its emergence is rooted in distinct research tracks:

The development of interactive theorem provers capable of encoding and machine-checking Olympiad problems.
The publication and public release of formalized Olympiad-level datasets and benchmarks such as miniF2F (2109.00110), FIMO (2309.04295), OlympiadBench (2402.14008), and others.
The practical need for standardized, reproducible, and challenging benchmarks to guide both neural and symbolic mathematical reasoning research.

The AIMO framework not only focuses on problem-solving but also supports the creation of a public ecosystem for formal mathematics, student education, automated evaluation, and collaborative advancement.

2. Datasets and Benchmarking Foundations

Multiple datasets underpin AIMO efforts, each targeting different aspects of Olympiad-level mathematics:

Dataset	Description	Formalisms / Formats	Source Problems	Notable Features
miniF2F	488 problems, cross-system, stratified by topic	Lean, Metamath, Isabelle, HOL	IMO, AMC, AIME, MATH	Multi-system, supports tactic-based proofs
FIMO	149 IMO Shortlisted Problems	Lean, with LaTeX informal proofs	IMO Shortlist (Algebra, NT)	Iterative auto-formalization w/ GPT-4
OlympiadBench	8,952 math/physics Olympiad problems (bilingual, mm)	Text, image, LaTeX	IMO, IPhO, Gaokao, China	Bilingual, multimodal, stepwise annotations
formalgeo7k/IMO	6,981 geometry, 2,627 IMO-level geometry problems	Custom formal languages	Geometry Olympiads	88 predicates, 196 theorems, diagrams
OpenMathReasoning	540K problems, 3.2M CoT solutions, 1.7M TIR sol.	Natural, Python, code-integrated	AoPS, Olympiad, comp math	Tool-integrated reasoning, solution selection
MathOdyssey	387 problems (Olympiad, HS, university)	Natural, LaTeX, solution steps	Expert-created	Chain-of-thought, final answer + steps

These datasets provide diverse benchmarks capturing algebra, number theory, inequalities, geometry, combinatorics, proof-based questions, and multimodal reasoning (2109.00110, 2309.04295, 2402.14008, 2310.18021, 2504.16891, 2406.18321).

3. Solution Formalization and Proof Infrastructure

A defining aspect of AIMO is the requirement for precise, formal, and verifiable solutions. Systems and protocols have been established for formalizing Olympiad problems:

Isabelle/HOL and Lean serve as primary proof assistants, enabling structured, machine-checkable script construction (2010.16015, 2411.18872). Solutions are formalized using definitions, lemmas, and structured induction; e.g., in Isar or Lean tactics, with type-level rigor blurring the gap between informal and formal arguments. For example,
1 2 3 4 5
theorem IMO_2006_SL_A2: fixes a :: "nat ⇒ real" assumes "a 0 = -1" and "⋀ n. n ≥ 1 ⟹ (∑ k≤n. a (n-k) / (k+1)) = 0" assumes "n ≥ 1" shows "a n > 0"
Decomposition methodology: Recent works decompose complex IMO proofs into hundreds of intermediate lemmas (e.g., 1,329 lemmas/40k lines Lean code in (2411.18872)), providing granular fail-points for AI models and facilitating diagnostic evaluation.
Geometry-specific frameworks (e.g., FormalGeo, FGPS, TongGeometry) introduce problem-specific languages and theorem databases (e.g., 88 predicates, 196 theorems (2310.18021); billions of new theorems (2412.10673)) optimized for geometric reasoning and including forward and backward search capabilities.
Symbolic Computation: Algebraic and functional synthesis problems are addressed using template-and-quantifier-elimination (template-and-QE) pipelines with SMT solvers for completeness, essential for tasks like “find all functions” (2404.12048).

These frameworks allow step-by-step verification, facilitate automation, and serve as a backbone for training and benchmarking models on rigorous solution writing.

4. Automated Reasoning Systems and Methodologies

AIMO research encompasses a spectrum of symbolic, neuro-symbolic, and language-model-based methods:

Symbolic and Deductive Engines: Wu’s method for geometric theorem proving translates hypotheses/conclusions into polynomial systems, using variable elimination and non-degeneracy conditions (2404.06405). Synthetic and deductive database (DD) methods (e.g., angle/chasing) are integrated for broader coverage.
Neuro-symbolic Approaches: AlphaGeometry and AlphaGeometry2 combine large-scale synthetic data generation (100–300M+ synthetic proofs), custom domain-specific language extensions, and neural network-guided construction. AG2 introduced support for non-constructive problems, movement of objects, and linear equations for angles and ratios—achieving 84% coverage and surpassing human gold medal performance on IMO geometry 2000–2024 (2502.03544).
Learning Approaches: Mixed-reasoning systems like AIPS utilize curriculum learning-guided value networks to heuristically drive proof search, achieving state-of-the-art results on Olympiad inequalities (2406.14219).
Monte Carlo Tree Self-Refine (MCTSr): An integration of LLMs with MCTS, where nodes in the tree represent candidate solution paths refined via heuristic self-critique and updated with quality scores, leading to marked success rate improvements on Olympiad-level benchmarks (2406.07394).
Step-By-Step Coding (SBSC): A multi-turn code-generation and execution approach, decomposing problems into sub-tasks, generating programs for each, and integrating outputs for subsequent subproblems—yielding 6–12% absolute accuracy improvements over state-of-the-art on AMC12, AIME, and MathOdyssey (2502.16666).
Tool-Integrated Reasoning (TIR): Combining LLMs with selective code execution, as implemented in OpenMathReasoning, is a central feature in recent AIMO prize-winning systems, enabling models to handle multi-hop calculations and brute-force search when pure text-based reasoning proves inadequate (2504.16891).
Advanced Training and Selection: Generative solution selection (GenSelect) pipelines compare candidate outputs on reasoning summaries, outperforming standard majority voting. Combined with tool integration and advanced filtering, these approaches are critical to state-of-the-art models’ competitive edge.

5. Evaluation, Metrics, and AIMO Competition Design

Evaluation of AIMO systems leverages a suite of strict and multifaceted metrics:

Proof Validity: For formal systems (Isabelle, Lean), pass rates on problem sets (e.g., Pass@1, Pass@8) are measured via machine-checking. In Lean, state-of-the-art models achieve pass rates below 25% for Olympiad-level problems; formal proof synthesis remains a substantial bottleneck (2109.00110, 2411.18872).
Functional Correctness: For tool-integrated and chain-of-thought systems, only solutions that result in the correct final numerical/symbolic answer are counted. Automatic grading integrates symbolic computation tools (SymPy, Python), answer normalization, and, for open-ended or proof-based questions, manual verification.
Token Efficiency: Recent research introduces mean token count as a metric for reasoning efficiency, optimized via reinforcement learning strategies such as GRPO (2507.08267).
Solution Selection: The use of generative selection models instead of majority voting yields further improvements in accuracy, especially in multi-candidate settings (2504.16891).
Benchmark Competitions: The AIMO Progress Prize challenges are structured as strictly “leak-free” multi-round competitions. Models are evaluated on unseen, high-difficulty problems and required to output solutions in prescribed formats (e.g., boxed final answers, LaTeX formatting).

A key outcome is that top models can now solve >30/50 problems on private Olympiad-style test sets, with pass@1 rates and efficiency metrics tracked alongside solution quality.

6. Challenges, Limitations, and Future Directions

Despite notable successes, significant challenges remain:

Formalization Gaps: Automated transformation of informal to fully formal proof continues to face limitations, both in data scarcity and in human-level interpretability; even advanced LLMs struggle to generate Lean proofs for unsolved IMO problems (2411.18872).
Generalization and Robustness: Models that perform well on curated or high-school–level datasets often falter on true Olympiad-level tasks, particularly those that require deep, multi-step chain-of-thought reasoning and symbolic manipulation (2402.14008, 2406.18321).
Human-AI Performance Disparities: Studies in mathematical reasoning across age and modality (e.g., SMART-840 for children’s Olympiads) indicate that AI models’ capabilities often diverge from human cognitive progression and foundational skills, especially in geometry and logic (2406.15736).
Language and Modality Adaptation: Recent work demonstrates progress on multilingual, cross-cultural datasets (e.g., Bangla Olympiad problems (2501.04425)) and on multimodal (image-text) reasoning, but performance lags behind in many such domains.
Data Scaling and Synthesis: New problem synthesis pipelines (PromptCoT) expand the Olympiad-level training corpus by orders of magnitude, but establishing meaningful evaluation and preventing data leakage in high-stakes competitions remains crucial (2503.02324).
Reproducibility and Open Science: Recent winning recipes emphasize open-sourcing code, models, and all training artifacts to enable rigorous verification and community contribution (2507.08267).

Future research directions stressed in the literature include automatic problem generation and solution pairing, integration of retrieval and code execution in multilingual contexts, leveraging formal decomposition methodologies for better diagnosis, and developing models that can both generate and select optimal solutions under strict competition conditions. There is also a forward-looking interest in hybrid systems that can both propose problems and act as geometry coaches, as exemplified by TongGeometry (2412.10673).

7. Impact and Broader Significance

AIMO research constitutes a watershed for AI-driven mathematical creativity, with implications far beyond competition. Advances in AIMO systems:

Enable new strategies for formal verification, program synthesis, and symbolic computation.
Serve as testbeds for AI systems with explainability, modularity, and interactive reasoning—key requirements in fields like software engineering, education, and scientific discovery.
Underpin practical tools for mathematical education, offering training, automated feedback, and even personalized Olympiad coaching or problem proposal (2412.10673).

The progression from dataset curation, formalization, and verification to integrated, competitive, and open-source AI systems positions AIMO as a central driver in the pursuit of genuinely general and creative artificial intelligence in mathematical domains.