Lean-IMO-Bench

Updated 7 June 2026

Lean-IMO-Bench is a benchmark suite featuring formalized IMO-style problems in Lean 4, emphasizing non-routine, multi-step proof synthesis across diverse mathematical domains.
It categorizes problems into basic and advanced splits, with tasks designed to test complex reasoning in algebra, combinatorics, number theory, and geometry.
Iterative and one-shot eval methodologies, demonstrated by LEAP's significant solve rate improvements, establish it as a cutting-edge tool for automated formal theorem proving research.

Lean-IMO-Bench is a benchmark suite comprising a rigorously formalized collection of International Mathematical Olympiad (IMO)-style problems in Lean 4, designed to evaluate and advance the capabilities of automated formal theorem proving systems, particularly LLMs and agentic frameworks. The benchmark distinguishes itself from prior efforts by focusing on short, informally stated problems that require highly non-routine and multi-step reasoning, pushing models beyond mere solution of routine, single-step tasks and emphasizing proof synthesis analogous to the mathematical demands encountered in the IMO environment (Kung et al., 2 Jun 2026).

1. Motivation and Rationale

Existing formal-theorem-proving benchmarks—including MiniF2F and PutnamBench—have reached a point of saturation, with most state-of-the-art systems succeeding on their “routine” subsets. These benchmarks fall short in modeling the intricacy of IMO-style mathematics: problems characterized by short statements but requiring deep, non-standard insights and extended, intricate proofs spanning diverse fields such as algebra, combinatorics, number theory, and geometry. Lean-IMO-Bench was developed to address this gap by providing a formally verified, diverse problem set that measures the ability of theorem provers and LLMs to handle genuinely non-routine reasoning and long proof scripts. Its problems demand decomposition, insight, and synthesis reflective of real olympiad challenges, making it an essential tool for benchmarking general-purpose reasoning agents and specialized provers alike (Kung et al., 2 Jun 2026).

2. Construction and Content

The foundation of Lean-IMO-Bench is IMO-Bench (EMNLP 2025), which aggregated 60 IMO-style problems vetted by former IMO medalists. Problems are organized into two balanced splits of 30 each, differentiated by complexity:

Basic Split (30 problems): Each problem is amenable to classical “B”-level IMO reasoning solvable in 3–5 conceptual steps.
Advanced Split (30 problems): Each requires 6+ non-routine steps and often demands the combination of multiple deep ideas.

Both splits feature balanced topical coverage:

Topic Area	Basic	Advanced
Algebra	8	8
Combinatorics	8	8
Number Theory	8	6
Geometry	6	8

Natural language statements are directly formalized in Lean 4, relying solely on Mathlib’s elementary libraries to facilitate concise statements and proofs without the crutch of advanced imports. Representative examples include functional equations, root-counting results for polynomials, and challenging combinatorial identities—all encoded in Lean theorem format with sorry placeholders for proofs, supporting automated proof attempts (Kung et al., 2 Jun 2026).

3. Dataset Structure and Usage

Lean-IMO-Bench is distributed as a modular Lean 4 project:

basic/ contains the 30 basic problems as sequenced files (PBBasic001.lean ... PBBasic030.lean)
advanced/ holds the advanced set (PBAdvance001.lean ... PBAdvance030.lean)
Accompanying driver scripts (evaluate.py, run_eval.sh) automate problem selection, proof injection, proof attempt execution via candidate agents, and results aggregation.

The benchmark prescribes the following workflow:

Set up Lean 4 and Mathlib 4 using Lake.
Clone the repository and bootstrap dependencies (lake update).
Use the provided scripts to test models:
- One-shot evaluation: python evaluate.py --bench basic --model <model-endpoint> --mode one_shot --samples 128
- Iterative feedback (rollouts): python evaluate.py --bench advanced --model <model-endpoint> --mode iterative --rollouts 2
Output includes per-problem pass/fail status, time taken, and number of iterations, suitable for high-throughput benchmarking (Kung et al., 2 Jun 2026).

4. Evaluation Methodology and Metrics

The primary metric is solve rate: the percentage of problems in a split (basic or advanced) for which a model produces a fully verified proof without manual intervention.

One-shot (pass@N): Measures how often a model can prove a problem in N independent attempts.
Iterative (rollouts): Allows for up to R rounds of feedback loops in which the agent receives compiler output and refines its proof attempt.

Multiple baselines are benchmarked, including general-purpose LLMs (Gemini 3.1-Pro), open specialized provers (Goedel-Prover-V2 32B), hybrid agentic frameworks (Hilbert), and state-of-the-art closed-source solvers (Aristotle). Comparative recall is reported for both basic and advanced splits (Kung et al., 2 Jun 2026).

Method	Basic (%)	Advanced (%)
Gemini 3.1-Pro (Pass@128)	20.0	3.3
Goedel-V2-32B (Pass@128)	10.0	0.0
Hilbert (rollout=2)	36.6	6.6
Aristotle (rollout=2)	76.7	20.0
LEAP (rollout=2)	83.3	56.7

Notable results include LEAP achieving one-shot solve rates of 83.3% (basic) and 56.7% (advanced), surpassing all prior baselines and outperforming the best non-LEAP baseline (Aristotle) by +6.6% (basic) and +36.7% (advanced). General-purpose LLMs without feedback remain <10% on one-shot solves, highlighting the nontrivial difficulty spectrum (Kung et al., 2 Jun 2026).

5. Comparative Position Among Benchmarks

CombiBench (Liu et al., 6 May 2025): Focuses exclusively on combinatorial problems, while Lean-IMO-Bench encompasses all four IMO domains. CombiBench established the fill-in-the-blank evaluation (Fine-Eval) and a Lean-native combinatorics corpus, but did not match the breadth or calibrated difficulty scaling of Lean-IMO-Bench.
LeanGeo-Bench (Song et al., 20 Aug 2025): Specializes in geometry via a custom DSL and SMT tactics, evaluating LLMs on 122 geometry problems consistent with IMO standards. Results demonstrate that even the strongest models plateau below 30% pass@4 on geometric problems, with no successful solves for actual IMO-level geometry, underscoring the challenging nature of Lean-IMO-Bench’s geometry subset.
IndiMathBench (Biyani et al., 30 Nov 2025): Curates 312 human-verified Lean 4 theorems from Indian mathematics olympiads, spanning all classical domains, and emphasizes human-AI collaborative autoformalization. Lean-IMO-Bench, by contrast, targets IMO and IMO-style problems with calibrated complexity, full automation of evaluation, and high granularity in difficulty splits.
Lean-IMO-Bench (decomposition edition) (Yousefzadeh et al., 2024): Targets miniF2F’s IMO+ set, supplementing it with full Lean proofs and 1,329 building-block lemmas for fine-grained LLM diagnosis. This variant offers rich lemma-level evaluation, aligning with Lean-IMO-Bench’s philosophy of non-trivial, multi-step proof obligations.

A plausible implication is that Lean-IMO-Bench, by offering broad topical coverage, clear difficulty gradation, and both one-shot and iterative evaluation modes, directly addresses the demonstrated performance and coverage limitations of prior Lean-based mathematical reasoning benchmarks.

6. Research Significance and Impact

Lean-IMO-Bench has enabled state-of-the-art results in automated formal mathematics: on its hardest problems, LEAP—an agentic foundation-model-powered system—boosted one-shot solve rates from under 10% to 70%, and iterative approaches enabled surpassing even specialized gold-medal IMO systems (achieving, e.g., 83.3% solve rate on the basic set and 56.7% on the advanced set). The dataset’s breadth, calibration, and reproducibility make it the reference testbed for benchmarking emerging agentic architectures and foundation models on olympiad-level mathematics (Kung et al., 2 Jun 2026).

The provided scripts, data, and open-source infrastructure—available at https://imobench.github.io and https://github.com/imobench/Lean-IMO-Bench—lower the barrier for rapid experimentation and standardized evaluation, facilitating progress toward AI systems capable of formalizing and proving “real” competition mathematics.

7. Accessibility and Future Directions

Lean-IMO-Bench is publicly accessible, with all problems, evaluation tools, and instructional documentation available online. The dataset accommodates varying evaluation paradigms (pass@N, rollouts), allowing researchers to compare single-step LLMs, interactive proof agents, and hybrid architectures under unified conditions.

Open research avenues include:

Closing the persistent gap on geometry problems, which remain a bottleneck even for leading systems.
Extending autoformalization pipelines to further domains or to translation from informal language.
Integrating with or bridging to corpus-style lemma generation benchmarks for micro-level proof synthesis diagnostics.
Advancing agentic or reinforcement learning paradigms using Lean-IMO-Bench’s multi-step, error-driven feedback environment as a training substrate.

By capturing the essence of IMO-level non-routine reasoning in formalized, machine-verifiable form, Lean-IMO-Bench provides an indispensable measurement tool as the field moves toward AI systems capable of tackling the full spectrum of high-level mathematical competition content (Kung et al., 2 Jun 2026).