AI Mathematical Olympiad (AIMO)
- AI Mathematical Olympiad (AIMO) is a research initiative and competition framework that benchmarks AI systems on high-level mathematical reasoning using formalized proofs and diverse datasets.
- It integrates state-of-the-art datasets, formal methods, and neuro-symbolic techniques to drive advancements in solving International Mathematical Olympiad problems.
- The initiative fosters innovations in automated reasoning, supports reproducible evaluation, and paves the way for enhanced tools in mathematical education and formal verification.
The AI Mathematical Olympiad (AIMO) is an ambitious research initiative and competition framework aimed at benchmarking, advancing, and ultimately achieving artificial intelligence systems capable of solving International Mathematical Olympiad (IMO) problems at or above the level of human gold medalists. AIMO encompasses a suite of challenges, benchmarking protocols, datasets, formalization and solution-generation tasks, and training methodologies specifically targeted toward automating the high-level mathematical reasoning required for Olympiad-level mathematics. It synthesizes progress from formal methods, symbolic reasoning, deep learning, dataset creation, solution evaluation, and competition-based model assessment.
1. Motivation and Historical Context
AIMO was conceived as a grand challenge in the AI community, with the explicit goal of building an AI system that can win a gold medal at the International Mathematical Olympiad. This paradigm represents not just problem-solving automation but a comprehensive test of formal reasoning, creativity, and generalization on mathematically rigorous and human-originating benchmarks (Marić et al., 2020). Its emergence is rooted in distinct research tracks:
- The development of interactive theorem provers capable of encoding and machine-checking Olympiad problems.
- The publication and public release of formalized Olympiad-level datasets and benchmarks such as miniF2F (Zheng et al., 2021), FIMO (Liu et al., 2023), OlympiadBench (He et al., 21 Feb 2024), and others.
- The practical need for standardized, reproducible, and challenging benchmarks to guide both neural and symbolic mathematical reasoning research.
The AIMO framework not only focuses on problem-solving but also supports the creation of a public ecosystem for formal mathematics, student education, automated evaluation, and collaborative advancement.
2. Datasets and Benchmarking Foundations
Multiple datasets underpin AIMO efforts, each targeting different aspects of Olympiad-level mathematics:
Dataset | Description | Formalisms / Formats | Source Problems | Notable Features |
---|---|---|---|---|
miniF2F | 488 problems, cross-system, stratified by topic | Lean, Metamath, Isabelle, HOL | IMO, AMC, AIME, MATH | Multi-system, supports tactic-based proofs |
FIMO | 149 IMO Shortlisted Problems | Lean, with LaTeX informal proofs | IMO Shortlist (Algebra, NT) | Iterative auto-formalization w/ GPT-4 |
OlympiadBench | 8,952 math/physics Olympiad problems (bilingual, mm) | Text, image, LaTeX | IMO, IPhO, Gaokao, China | Bilingual, multimodal, stepwise annotations |
formalgeo7k/IMO | 6,981 geometry, 2,627 IMO-level geometry problems | Custom formal languages | Geometry Olympiads | 88 predicates, 196 theorems, diagrams |
OpenMathReasoning | 540K problems, 3.2M CoT solutions, 1.7M TIR sol. | Natural, Python, code-integrated | AoPS, Olympiad, comp math | Tool-integrated reasoning, solution selection |
MathOdyssey | 387 problems (Olympiad, HS, university) | Natural, LaTeX, solution steps | Expert-created | Chain-of-thought, final answer + steps |
These datasets provide diverse benchmarks capturing algebra, number theory, inequalities, geometry, combinatorics, proof-based questions, and multimodal reasoning (Zheng et al., 2021, Liu et al., 2023, He et al., 21 Feb 2024, Zhang et al., 2023, Moshkov et al., 23 Apr 2025, Fang et al., 26 Jun 2024).
3. Solution Formalization and Proof Infrastructure
A defining aspect of AIMO is the requirement for precise, formal, and verifiable solutions. Systems and protocols have been established for formalizing Olympiad problems:
- Isabelle/HOL and Lean serve as primary proof assistants, enabling structured, machine-checkable script construction (Marić et al., 2020, Yousefzadeh et al., 28 Nov 2024). Solutions are formalized using definitions, lemmas, and structured induction; e.g., in Isar or Lean tactics, with type-level rigor blurring the gap between informal and formal arguments. For example,
1 2 3 4 5
theorem IMO_2006_SL_A2: fixes a :: "nat ⇒ real" assumes "a 0 = -1" and "⋀ n. n ≥ 1 ⟹ (∑ k≤n. a (n-k) / (k+1)) = 0" assumes "n ≥ 1" shows "a n > 0"
- Decomposition methodology: Recent works decompose complex IMO proofs into hundreds of intermediate lemmas (e.g., 1,329 lemmas/40k lines Lean code in (Yousefzadeh et al., 28 Nov 2024)), providing granular fail-points for AI models and facilitating diagnostic evaluation.
- Geometry-specific frameworks (e.g., FormalGeo, FGPS, TongGeometry) introduce problem-specific languages and theorem databases (e.g., 88 predicates, 196 theorems (Zhang et al., 2023); billions of new theorems (Zhang et al., 14 Dec 2024)) optimized for geometric reasoning and including forward and backward search capabilities.
- Symbolic Computation: Algebraic and functional synthesis problems are addressed using template-and-quantifier-elimination (template-and-QE) pipelines with SMT solvers for completeness, essential for tasks like “find all functions” (Brown et al., 18 Apr 2024).
These frameworks allow step-by-step verification, facilitate automation, and serve as a backbone for training and benchmarking models on rigorous solution writing.
4. Automated Reasoning Systems and Methodologies
AIMO research encompasses a spectrum of symbolic, neuro-symbolic, and language-model-based methods:
- Symbolic and Deductive Engines: Wu’s method for geometric theorem proving translates hypotheses/conclusions into polynomial systems, using variable elimination and non-degeneracy conditions (Sinha et al., 9 Apr 2024). Synthetic and deductive database (DD) methods (e.g., angle/chasing) are integrated for broader coverage.
- Neuro-symbolic Approaches: AlphaGeometry and AlphaGeometry2 combine large-scale synthetic data generation (100–300M+ synthetic proofs), custom domain-specific language extensions, and neural network-guided construction. AG2 introduced support for non-constructive problems, movement of objects, and linear equations for angles and ratios—achieving 84% coverage and surpassing human gold medal performance on IMO geometry 2000–2024 (Chervonyi et al., 5 Feb 2025).
- Learning Approaches: Mixed-reasoning systems like AIPS utilize curriculum learning-guided value networks to heuristically drive proof search, achieving state-of-the-art results on Olympiad inequalities (Wei et al., 20 Jun 2024).
- Monte Carlo Tree Self-Refine (MCTSr): An integration of LLMs with MCTS, where nodes in the tree represent candidate solution paths refined via heuristic self-critique and updated with quality scores, leading to marked success rate improvements on Olympiad-level benchmarks (Zhang et al., 11 Jun 2024).
- Step-By-Step Coding (SBSC): A multi-turn code-generation and execution approach, decomposing problems into sub-tasks, generating programs for each, and integrating outputs for subsequent subproblems—yielding 6–12% absolute accuracy improvements over state-of-the-art on AMC12, AIME, and MathOdyssey (Singh et al., 23 Feb 2025).
- Tool-Integrated Reasoning (TIR): Combining LLMs with selective code execution, as implemented in OpenMathReasoning, is a central feature in recent AIMO prize-winning systems, enabling models to handle multi-hop calculations and brute-force search when pure text-based reasoning proves inadequate (Moshkov et al., 23 Apr 2025).
- Advanced Training and Selection: Generative solution selection (GenSelect) pipelines compare candidate outputs on reasoning summaries, outperforming standard majority voting. Combined with tool integration and advanced filtering, these approaches are critical to state-of-the-art models’ competitive edge.
5. Evaluation, Metrics, and AIMO Competition Design
Evaluation of AIMO systems leverages a suite of strict and multifaceted metrics:
- Proof Validity: For formal systems (Isabelle, Lean), pass rates on problem sets (e.g., Pass@1, Pass@8) are measured via machine-checking. In Lean, state-of-the-art models achieve pass rates below 25% for Olympiad-level problems; formal proof synthesis remains a substantial bottleneck (Zheng et al., 2021, Yousefzadeh et al., 28 Nov 2024).
- Functional Correctness: For tool-integrated and chain-of-thought systems, only solutions that result in the correct final numerical/symbolic answer are counted. Automatic grading integrates symbolic computation tools (SymPy, Python), answer normalization, and, for open-ended or proof-based questions, manual verification.
- Token Efficiency: Recent research introduces mean token count as a metric for reasoning efficiency, optimized via reinforcement learning strategies such as GRPO (Yoshihara et al., 11 Jul 2025).
- Solution Selection: The use of generative selection models instead of majority voting yields further improvements in accuracy, especially in multi-candidate settings (Moshkov et al., 23 Apr 2025).
- Benchmark Competitions: The AIMO Progress Prize challenges are structured as strictly “leak-free” multi-round competitions. Models are evaluated on unseen, high-difficulty problems and required to output solutions in prescribed formats (e.g., boxed final answers, LaTeX formatting).
A key outcome is that top models can now solve >30/50 problems on private Olympiad-style test sets, with pass@1 rates and efficiency metrics tracked alongside solution quality.
6. Challenges, Limitations, and Future Directions
Despite notable successes, significant challenges remain:
- Formalization Gaps: Automated transformation of informal to fully formal proof continues to face limitations, both in data scarcity and in human-level interpretability; even advanced LLMs struggle to generate Lean proofs for unsolved IMO problems (Yousefzadeh et al., 28 Nov 2024).
- Generalization and Robustness: Models that perform well on curated or high-school–level datasets often falter on true Olympiad-level tasks, particularly those that require deep, multi-step chain-of-thought reasoning and symbolic manipulation (He et al., 21 Feb 2024, Fang et al., 26 Jun 2024).
- Human-AI Performance Disparities: Studies in mathematical reasoning across age and modality (e.g., SMART-840 for children’s Olympiads) indicate that AI models’ capabilities often diverge from human cognitive progression and foundational skills, especially in geometry and logic (Cherian et al., 22 Jun 2024).
- Language and Modality Adaptation: Recent work demonstrates progress on multilingual, cross-cultural datasets (e.g., Bangla Olympiad problems (Tabib et al., 8 Jan 2025)) and on multimodal (image-text) reasoning, but performance lags behind in many such domains.
- Data Scaling and Synthesis: New problem synthesis pipelines (PromptCoT) expand the Olympiad-level training corpus by orders of magnitude, but establishing meaningful evaluation and preventing data leakage in high-stakes competitions remains crucial (Zhao et al., 4 Mar 2025).
- Reproducibility and Open Science: Recent winning recipes emphasize open-sourcing code, models, and all training artifacts to enable rigorous verification and community contribution (Yoshihara et al., 11 Jul 2025).
Future research directions stressed in the literature include automatic problem generation and solution pairing, integration of retrieval and code execution in multilingual contexts, leveraging formal decomposition methodologies for better diagnosis, and developing models that can both generate and select optimal solutions under strict competition conditions. There is also a forward-looking interest in hybrid systems that can both propose problems and act as geometry coaches, as exemplified by TongGeometry (Zhang et al., 14 Dec 2024).
7. Impact and Broader Significance
AIMO research constitutes a watershed for AI-driven mathematical creativity, with implications far beyond competition. Advances in AIMO systems:
- Enable new strategies for formal verification, program synthesis, and symbolic computation.
- Serve as testbeds for AI systems with explainability, modularity, and interactive reasoning—key requirements in fields like software engineering, education, and scientific discovery.
- Underpin practical tools for mathematical education, offering training, automated feedback, and even personalized Olympiad coaching or problem proposal (Zhang et al., 14 Dec 2024).
The progression from dataset curation, formalization, and verification to integrated, competitive, and open-source AI systems positions AIMO as a central driver in the pursuit of genuinely general and creative artificial intelligence in mathematical domains.