IMO Grand Challenge

Updated 24 June 2026

IMO Grand Challenge is an initiative to develop AI systems that generate rigorous, human-level proofs for complex Mathematical Olympiad problems.
Core methodologies include iterative self-verification loops, decoupled reasoner–prover architectures, and neuro-symbolic geometry engines for robust problem solving.
Advances from this challenge are benchmarked against IMO problems, driving progress in automated theorem proving, formal verification, and AI mathematical reasoning.

The IMO Grand Challenge refers to the pursuit of building automated reasoning systems—specifically AI systems—that can solve International Mathematical Olympiad (IMO) problems at gold-medalist human level, producing verifiable solutions. The challenge has become a central benchmark for progress in machine mathematical reasoning, integrating advances in LLMs, formal theorem proving, neuro-symbolic integration, and underlying mathematical infrastructure (Huang et al., 21 Jul 2025, Liang et al., 7 Jul 2025, Sinha et al., 2024, Marić et al., 2020).

1. Definition and Motivation

The IMO Grand Challenge is to construct AI systems capable of solving, end-to-end, the full spectrum of IMO problems by delivering rigorous proofs that satisfy both human and formal-verification standards. Success on this benchmark implies significant advances in mathematical creativity, strategic lemma generation, proof formalization, and the orchestration of both informal and formal reasoning. The IMO is chosen as the gold standard because its problems are adversarially constructed to elude rote tactics, requiring deep insight and multi-step argumentation; even the strongest open-source provers have historically failed on many such instances (Liang et al., 7 Jul 2025).

Motivation arises from three directions:

IMO problems outstrip standard mathematical datasets (e.g., AIME, Putnam) in both combinatorial complexity and ingenuity required (Huang et al., 21 Jul 2025).
There exists a substantial gap between informal solution generation (where LLMs exceed 80% accuracy) and verified formal proofs (success less than 8%), emphasizing the need for hybrid or modular architectures (Liang et al., 7 Jul 2025).
Progress on this challenge directly maps onto advances in automated theorem proving, interactive proof assistants, and long-context neuro-symbolic modeling.

2. Core Technical Approaches

Developments in the IMO Grand Challenge have centered on three paradigms:

a. Self-Verification LLM Pipelines

Pioneered in "Gemini 2.5 Pro Capable of Winning Gold at IMO 2025" (Huang et al., 21 Jul 2025), the approach is built on an iterative generate–verify–revise loop:

Initial Solution Generation: LLM (e.g., Gemini 2.5 Pro) is prompted as an expert solver to produce a method sketch and detailed TeX-formatted proof, under a 32K-token budget.
Self-Improvement: The model critiques its own partial draft and continues the proof, effectively doubling the available reasoning budget.
Automated Verification: A distinct prompt configures the model as an IMO grader to classify each line as "OK," "Justification Gap," or "Critical Error."
Human-in-the-Loop (Optional): External review filters false positives/negatives in the bug report.
Correction: Re-prompted with the bug report to fix all detected issues.
Acceptance Criterion: Proof must pass the verifier five consecutive times without issue before acceptance, otherwise the solution attempt is abandoned.

The pipeline is explicitly designed to separate solution generation from verification. Key prompt snippets enforce "rigor is paramount," transparent reporting of completeness, and formal output structuring (method sketch + detailed proof). The verifier prompt strictly prohibits the model from performing corrections during verification, enforcing a grader–solver separation.

b. Decoupled Reasoner–Prover Architectures

A contrasting, modular strategy is advanced in "Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving" (Liang et al., 7 Jul 2025). Here, the system is split into:

Reasoner: An LLM generates a strategic breakdown of the main theorem into explicit lemma candidates in formal language (e.g., Lean 4), guided by proof-planning prompts.
Prover: A specialized formal prover attempts to verify each lemma (up to specified resource limits); only verified lemmas are retained.
Proof Assembly: The final proof of the main theorem is constructed by invoking previously verified subgoals.

This division mitigates the known failures of end-to-end training, which tends to penalize deep, creative reasoning in favor of short, tactic-based solutions.

c. Symbolic and Neuro-Symbolic Geometry Engines

For Euclidean geometry, high performance has been achieved via:

Wu's Method: An algebraic elimination scheme converting geometric constraints into polynomial ideal membership tests using Ritt–Wu pseudo-division and triangularization (Sinha et al., 2024).
Classical Synthetic Methods: Deductive databases and angle/ratio/distance chasing rules, executed by forward-chaining.
Neuro-Symbolic Integration: Most notably via AlphaGeometry, which combines a transformer LLM for construction suggestion with a robust symbolic backend, trained on 100 million synthetic problems. Ensembles of Wu's method and AlphaGeometry achieve superior coverage (Sinha et al., 2024).

3. Current Benchmarks and Quantitative Metrics

Recent work quantifies progress using specific datasets and success criteria:

2025 IMO Problems: Gemini 2.5 Pro solves 5 out of 6, each passing the verifier five times consecutively, corresponding to an 83% success rate on uncontaminated, post-release test data (Huang et al., 21 Jul 2025). Prior public LLMs had not exceeded 3-4 correct.
IMO-AG-30 Geometry Benchmark: Ensembling Wu's method and AlphaGeometry yields 27/30 solved instances, surpassing the average gold medalist (who averages 25.9/30) (Sinha et al., 2024). Wu's method alone, surprisingly, reaches 15/30.
Formal Non-Geometry Problems (2000-2024): The decoupled Reasoner–Prover pipeline solves 5 out of 100, making the first open-source advances into IMO non-geometry via formal proofs (Liang et al., 7 Jul 2025).
Human Baselines: Silver and gold medalist average scores are set at 22.9/30 and 25.9/30 on geometry problem sets, and “fully verified” machine solutions are directly compared to these thresholds.

4. Representative Workflows and Problem Types

IMO Grand Challenge research addresses the diverse styles of IMO problems:

Algebra and Functional Equations: Modular proof-by-induction routines, chain-of-equality manipulations, and auxiliary lemmas (e.g., sum reindexing, recurrences) dominate. Formalizations in Isabelle/HOL and Lean 4 are public, offering many mechanization blueprints (Marić et al., 2020).
Combinatorics: Colorings, tilings, and cardinality arguments require finite set manipulations, sums over intervals, and case analyses, often supported by custom helper lemmas to bridge low-level gaps.
Number Theory: Recursions, valuation arguments, and extremal constructions feature in both LLM-based and prover-based discoveries.
Geometry: Algebraic encodings (via coordinate assignments and polynomialization), integration of synthetic methods, and angle/ratio chasing. Wu's method and AlphaGeometry underpin state-of-the-art performance, but interpretability and coverage remain open targets (Sinha et al., 2024).

Each solution pipeline is engineered to expose gaps (truth/falsity or missing/incomplete justifications), leveraging tool-specific constructs (e.g., Isar proof scripts in Isabelle/HOL; theorem declarations plus “by sorry” stubs in Lean; LaTeX method sketches in LLM outputs).

5. Datasets, Formalization Efforts, and Toolchains

Public artifacts and tools emerging from IMO Grand Challenge initiatives include:

Belgrade IMO Isabelle/HOL Repository: A systematically organized set of formalizations, helper theories, and mechanized proofs, covering algebra, combinatorics, and number theory. The repository provides a granular corpus of solution strategies and structures, supporting both manual and automated tactic suggestion (Marić et al., 2020).
Tencent IMO Lemma Dataset: Over 1,200 verified lemma statements for 100 IMO problems, with JSON metadata including subgoal declarations and Lean proof scripts, supporting studies of modular decomposition and transfer across problems (Liang et al., 7 Jul 2025).
Synthetic Training Sets for Neuro-Symbolic Models: AlphaGeometry is pre-trained and fine-tuned on an unprecedented scale using synthetic geometry problems paired with known proof traces, bootstrapping large-scale LLM-based construction suggestion (Sinha et al., 2024).

These resources not only benchmark performance but provide a laboratory for transfer learning, routine abstraction, and proof-search algorithm development.

6. Open Challenges and Future Directions

Ongoing efforts in the programmatic solution of IMO problems highlight unresolved technical fronts:

Reasoning-Token Bottleneck: Current LLMs can barely fit a single complete IMO proof in available context; iterative extension (“self-improvement passes”) is an ad hoc workaround (Huang et al., 21 Jul 2025).
Verifier Calibration: Automated verifiers may misclassify non-crucial “gaps,” generating excessive editing overhead, while occasionally missing deep errors (Huang et al., 21 Jul 2025).
Multi-Agent and Ensemble Methods: Richer pipelines, exploiting model diversity (e.g., “Grok 4 Heavy–style” voting, AlphaGeometry ∪ Wu), are expected to add robustness and coverage (Sinha et al., 2024).
Formalization of Geometry: Open-source proof assistants (e.g., Isabelle/HOL) still lack competitive synthetic geometry libraries. Extending formal backends to handle all problem types remains a major step (Marić et al., 2020).
Automated Sub-Lemma Discovery and Reuse: Tools to break unprovable lemmas into solvable fragments, enabling deeper recursion in proof-planning (Liang et al., 7 Jul 2025).
Human-Interpretability: The path from algebraic or neuro-symbolic certificates to human-readable arguments, especially in geometry, remains a desideratum (Sinha et al., 2024).
Adversarial Generalization: Controlled extension beyond the Olympiad style—to adversarial and open-ended domains—is necessary to avoid overfitting and to advance general AI mathematical reasoning.

A plausible implication is that success will require integrated, modular architectures coupling high-level insight generation with efficient, formally-verifiable low-level proving components, supported by rich, reusable lemma libraries and formal representations at all levels.

7. Significance and Outlook

The IMO Grand Challenge has crystallized as a touchstone for AI and automated reasoning. It motivates research in prompt engineering, modular pipelines, formalization, and neuro-symbolic integration. Progress has been rapid and measurable: the gap between AI and gold-medalist human performance is now closing in both geometry and, increasingly, in combinatorics, algebra, and number theory. The deployment of public datasets and formal proof corpora accelerates collective advancement and enables robust benchmarking. Despite advances, fully end-to-end IMO solution—particularly for the hardest problems and under formal verification—remains unsolved. Its pursuit continues to drive fundamental research at the intersection of AI, mathematics, and formal methods (Huang et al., 21 Jul 2025, Liang et al., 7 Jul 2025, Marić et al., 2020, Sinha et al., 2024).