Formalization of IMO Problems
- Formalization of IMO problems is the process of converting informal, natural-language math challenges into rigorously defined, machine-verifiable proofs using theorem provers.
- It involves translating diverse domains like algebra, geometry, and combinatorics into formal languages such as Lean, Isabelle/HOL, and custom DSLs to ensure precise representation.
- Advanced methods—including human-AI hybrid workflows and lemma decomposition—boost success rates and facilitate benchmarking, while addressing challenges like quantifier handling and library limits.
The formalization of International Mathematical Olympiad (IMO) problems refers to the rigorous encoding of IMO-level mathematical statements and proofs within formal logical systems, typically using interactive or automated theorem provers. This process transforms inherently informal, natural-language problem statements and human-written arguments into syntactically precise, machine-verifiable objects suitable for automated reasoning, corpus construction, and AI research. The challenge is multi-faceted: it involves translating intricate and diverse mathematical domains (algebra, number theory, geometry, combinatorics) into structured formalisms, engineering supporting libraries and notations, designing benchmarks for evaluating both human and AI solvers, and architecting workflows that accommodate both manual and automated contributions. Over the last several years, a suite of datasets, frameworks, and evaluation harnesses have become essential resources for benchmarking and advancing the state of formalized mathematical reasoning at the Olympiad level.
1. Problem Selection and Domain Coverage
The selection of IMO problems for formalization typically spans various sources and mathematical domains, with datasets focusing on shortlist problems, official contests, or handpicked benchmarks. Notable efforts include:
- FIMO formalizes 149 algebra and number theory problems from the IMO Shortlist (2006–2021), achieving a 60.8% success rate for machine-assisted formalization (71.8% in algebra, 49.6% in number theory) (Liu et al., 2023).
- CombiBench covers combinatorics, including all IMO combinatorial problems since 2000 (excluding image-based ones), formalized in Lean 4 (Liu et al., 6 May 2025).
- LeanGeo spans geometry, incorporating all 43 IMO geometry problems since 2000 into Lean 4 (Song et al., 20 Aug 2025).
- miniF2F provides a cross-system corpus with 40 IMO problems as part of a 488-problem set, covering algebra, number theory, and inequalities, suitable for Metamath, Lean, Isabelle, and HOL Light (Zheng et al., 2021).
- FormalGeo encodes a smaller number (currently 18) of IMO geometry problems in a custom formal system, expandable via augmentation (Zhang et al., 2023).
- Lean-IMO Datasets (e.g., "Small Steps…" (Yousefzadeh et al., 2024)) provide detailed formalizations and lemma decompositions for all 20 miniF2F IMO test problems and selected recent IMOs.
Problem selection often omits classical geometry and advanced combinatorics in early datasets due to insufficient library support—though recent frameworks (LeanGeo, AlphaGeometry2, FormalGeo) now admit a much broader range, including locus problems, movement, and non-constructive statements (Chervonyi et al., 5 Feb 2025, Zhang et al., 2023).
2. Formalization Languages and Representational Strategies
The predominant target languages are Lean (both v3+mathlib and v4+mathlib4), Isabelle/HOL, and custom DSLs for geometry.
- Lean (v3/v4): Theorems are encoded with explicit parameter lists, hypotheses as assumptions, and conclusions as goals. Common imports include real numbers, sets, finsets (for combinatorics), and algebraic and geometric modules. Managed codebases often use naming conventions for easy cross-reference with public competition archives and mathlib (Liu et al., 2023, Liu et al., 6 May 2025, Song et al., 20 Aug 2025, Yousefzadeh et al., 2024).
- Isabelle/HOL: Statements utilize nat, int, real, and set types with Isar-style block-structured proofs and abundant use of automation (e.g., sledgehammer, metis) (Marić et al., 2020).
- Custom Geometry DSLs: Both AlphaGeometry2 and FormalGeo introduce expressive syntaxes with primitive predicates (collinearity, concyclicity, equal angles, ratios, locus queries). AlphaGeometry2 offers 88 predicates and 196 theorems, balancing succinctness and coverage (Chervonyi et al., 5 Feb 2025, Zhang et al., 2023).
- Hybrid, Multi-Assistant Benchmarks: miniF2F and RIMO coordinate formalizations across Metamath, Lean, Isabelle, and HOL Light to ensure cross-system comparability and stress-test formal reasoning capabilities (Zheng et al., 2021, Chen et al., 9 Sep 2025).
Representative encodings often closely mirror the original LaTeX, but always isolate all parameters, hypotheses, and quantifiers. For answer-seeking (classification) problems, goals are reframed as uniqueness or sum-formulas to admit single-valued outputs, as seen in RIMO-N (Chen et al., 9 Sep 2025).
3. Methodologies for Formalization: Human-in-the-Loop, Automation, and Decomposition
Formalization methodologies span the spectrum from fully manual to hybrid human–AI pipelines:
- Manual Formalization: Practitioners encode problem statements and proofs stepwise, as in miniF2F, Isabelle/HOL-IMO, and the "Small Steps..." dataset (Zheng et al., 2021, Marić et al., 2020, Yousefzadeh et al., 2024). This ensures semantic fidelity but is labor-intensive.
- Semi-Automatic with Reflection: FIMO demonstrates an iterative "autoformalization-with-reflection" workflow: GPT-4 proposes Lean statements, Lean is invoked to check/suggest error corrections, and humans verify semantic alignment. Up to five correction rounds boost success rates from 32.6% to 60.8% (Liu et al., 2023).
- Decomposition into Lemmas: Several benchmarks systematically break down full proofs into networks of intermediate lemmas:
- RIMO-P decomposes each proof into 1-4 logically ordered subproblems, enabling stepwise grading (Chen et al., 9 Sep 2025).
- The "Small Steps..." dataset extracts and curates 1,329 nontrivial lemmas from full IMO proofs, explicitly avoiding trivial (single-tactic) steps (Yousefzadeh et al., 2024).
- LEAP and Aristotle leverage blueprint/lemma planning, where informal arguments scaffold sketches of formal statements with auxiliary lemmas inserted as formal subgoals (Kung et al., 2 Jun 2026, Achim et al., 1 Oct 2025).
- Geometry-Specific Engineering: AlphaGeometry2, FormalGeo, and LeanGeo implement domain-specific logics with hundreds of high-level predicates and theorems. They express constructions, metric relations, and non-constructive constraints (ratios, movement) through concise, extensible DSLs (Chervonyi et al., 5 Feb 2025, Zhang et al., 2023, Song et al., 20 Aug 2025).
Typical formalization difficulties arise from quantifier misplacement, implicit/informal assumptions in original phrasing, library limitations for specialized constructs, and subtle mismatches between formal and informal semantics.
4. Benchmarking, Evaluation Protocols, and Automated Grading
Evaluation strategies aim to ensure rigor, reproducibility, and diagnostic clarity:
- Exact-Match, Fill-in-the-Blank, and Deterministic Grading: RIMO-N uses integer-only answer formats, so solution correctness is established by O(1) string matching; CombiBench's Fine-Eval protocol demands sorry-free Lean files compiling to specified values, with optional normalization for mathematically equivalent answers (Chen et al., 9 Sep 2025, Liu et al., 6 May 2025).
- Stepwise and Subproblem Evaluation: RIMO-P introduces a LLM-based, JSON-formatted judge for subproblem chains; the model can only progress if all prior steps are deemed correct (Chen et al., 9 Sep 2025). Similar decomposition enables fine-grained diagnosis of LLM failures in the "Small Steps..." and LEAP frameworks (Yousefzadeh et al., 2024, Kung et al., 2 Jun 2026).
- Cross-System and Multi-Language Evaluation: miniF2F aligns statements across four theorem provers, with baseline metrics like Pass@N (fraction solved after N attempts), average proof length, and per-category breakdowns (Zheng et al., 2021).
- Agentic and Self-Refining Search: LEAP maintains an AND-OR DAG of goal decompositions, employing the Lean compiler as a search oracle and an LLM reviewer to regulate subgoal proposals and discourage unhelpful lemma restatements (Kung et al., 2 Jun 2026).
- Geometry Benchmarks: AlphaGeometry2, LeanGeo, and FormalGeo assess both coverage (percentage of IMO geometry problems expressible in their systems—88% for AlphaGeometry2) and solving success rates (AlphaGeometry2: 84% on IMO geometry, LeanGeo: near-zero for the hardest Olympiad problems with current LLMs) (Chervonyi et al., 5 Feb 2025, Song et al., 20 Aug 2025).
5. Technical, Architectural, and Methodological Innovations
Several engineering and methodology themes recur across successful formalization efforts:
- Feedback-Reflective Loops: Iterative interaction between LLMs and proof assistants (error-correction cycles) substantially increases the yield of machine-consistent formalizations (Liu et al., 2023).
- Lemma-Based Proof Search: Incorporating autoformalized, human-suggested, or LM-extracted lemmas dramatically improves accessibility, success rates, and proof search tractability over single-stage, tactic-centric approaches (Achim et al., 1 Oct 2025, Kung et al., 2 Jun 2026, Liang et al., 7 Jul 2025).
- Blueprint Scaffolding and AND-OR Structures: Hierarchical memoization and anticipatory lemma planning—exemplified by LEAP—avoid exponential explosion in DFS-style search and can be coupled tightly to proof assistants' feedback or external reviewers (Kung et al., 2 Jun 2026).
- Geometry DSLs and Engineered Solvers: For geometry, domain-specific languages (AlphaGeometry2, FormalGeo, LeanGeo) encode constructions, constraints, and determination theorems, closely mirroring natural geometric argumentation and supporting non-constructive, locus-based, and metric relations (Chervonyi et al., 5 Feb 2025, Zhang et al., 2023, Song et al., 20 Aug 2025).
- Benchmark Construction and Ground-Truth Cross-Verification: RIMO and Lean-IMO-Bench utilize expert-checked or community-validated ground truth for both answer and proof formats, ensuring reliable and reproducible evaluation (Chen et al., 9 Sep 2025, Kung et al., 2 Jun 2026).
- Extensibility and Data Augmentation: Frameworks such as FormalGeo and AlphaGeometry2 support seamless addition of new predicates and theorems, facilitating broadening of coverage and finer stratification by difficulty or topic (Zhang et al., 2023, Chervonyi et al., 5 Feb 2025).
6. Persistent Challenges, Limitations, and Future Directions
Challenges persist on several fronts:
- Library Maturity and Expressivity: Many IMO geometry and combinatorics problems remain outside reach of general-purpose theorem prover libraries, requiring continual extension (especially for advanced combinatorial objects, geometric transformations, and movement/locus queries) (Liu et al., 2023, Liu et al., 6 May 2025, Chervonyi et al., 5 Feb 2025).
- Quantifier Handling and Natural-Language Drift: Autoformalization remains brittle to quantifier order, implicit structural constraints, and subtle distinctions between existence, uniqueness, and classification problems (Liu et al., 2023, Chen et al., 9 Sep 2025).
- Proof Construction Gaps: LLMs, even with blueprint or lemma guidance, often misapply tactics, introduce mathematically invalid arguments, or conflate similar hypotheses, especially on the most challenging problems (e.g., recent IMO shortlist or combinatorial geometry) (Liu et al., 2023, Song et al., 20 Aug 2025, Yousefzadeh et al., 2024).
- Deterministic Scoring vs. Expressive Goals: Answer-based benchmarks (e.g., RIMO-N) trade off mathematical richness for deterministic evaluation, while proof-based approaches require robust and objective stepwise judging (Chen et al., 9 Sep 2025).
- Geometry-Specific Tactics: Embedding SMTs in geometry (as in LeanGeo) improves some automation, but leveraging area-method, coordinate, or algebraic approaches remains an open engineering problem (Song et al., 20 Aug 2025).
- Automated Lemma Utilization: Provers often fail to call upon pre-proved lemmas unless specifically guided or fine-tuned for context sensitivity, leading to search inefficiency and missed integrations (Liang et al., 7 Jul 2025).
Planned extensions include expanded coverage within geometry and combinatorics, cross-assistant pipelines for Lean, Isabelle, and Coq, tighter integration with tactic-guidance tools (Tactic Toe, Sledgehammer, library_search), and dynamic embedding of formal benchmarks into neural proof search agents (Liu et al., 2023, Kung et al., 2 Jun 2026).
7. Impact, Resources, and Community Practices
The systematic formalization of IMO problems accelerates research into AI-based mathematical reasoning and theorem proving, underpins new agentic architectures, and provides public benchmarks essential for replicable progress and fair comparison. Publicly available datasets (FIMO, miniF2F, RIMO, LeanGeo-Bench, CombiBench, formalgeo7k, IMO-Steps) serve both as evaluation suites and as foundational corpora for training, fine-tuning, and ablation studies.
Key best practices, distilled from current work (Yousefzadeh et al., 2024), include decomposing proofs into nontrivial lemma units, maintaining explicit datasheets for topics and difficulty, careful library management and versioning, and the use of rigorous, publication-grade verification and grading pipelines. The formalization of IMO problems thus constitutes both a technical and community-driven bridge from natural, creative mathematical problem solving to fully machine-verifiable, extensible, and scalable mathematical knowledge systems.