NuminaMath-LEAN: Modular Lean Datasets for AI Proofs

Updated 14 December 2025

NuminaMath-LEAN is a modular initiative that organizes Lean formal proofs for challenging mathematics by decomposing complex IMO problems into systematically verified lemmas.
It provides a detailed dataset including hand-crafted Lean proofs, automated lemma extraction methods, and comprehensive metadata for rigorous AI benchmarking.
The project integrates evaluation pipelines and LLM-guided tactic prediction tools to advance research in automated theorem proving and formal mathematics.

NuminaMath-LEAN is a term that refers to a large-scale, modular initiative for constructing, organizing, and evaluating datasets and toolchains at the interface of Lean formal proof and machine learning, with an emphasis on building resources sufficient to train, benchmark, and explain AI models on hard mathematical problem solving. It encompasses both hand-crafted Lean formalizations of challenging mathematics (notably International Mathematical Olympiad (IMO) problems), and the development of procedures—pipelines, decomposition strategies, evaluation metrics—enabling automatic translation, lemma extraction, and cross-system integrations for research in mechanized mathematics and formal theorem proving.

1. Origin and Scope

The NuminaMath-LEAN concept is rooted in efforts to systematically lower the barriers to machine formalization of challenging math, focusing on both the data bottleneck and the need for granular diagnostic evaluation. The project emerged from recent work to close high-profile gaps in public Lean datasets for hard olympiad problems, most notably with the introduction of a modular, lemma-oriented dataset derived from full formal proofs of previously unsolved IMO problems (Yousefzadeh et al., 2024). NuminaMath-LEAN aggregates several axes:

Hand-written, fully verified Lean 3/4 proofs for olympiad-level math;
Principled decomposition of these proofs into hundreds of nontrivial, "atomized" lemmas, providing mid-level challenges;
Synthesis of evaluation pipelines that allow AI-generated proofs (from LLMs, tactic prediction models, hybrid search systems) to be benchmarked in Lean;
Modular organization compatible with collaborative development, codebase extension, and cross-formalization (e.g., with Coq, Isabelle);
A long-term roadmap toward robust, retrieval-augmented and tactic-guided learning environments for formal math.

2. Dataset Construction and Decomposition Methodology

A distinguishing feature of NuminaMath-LEAN is its rigorous decomposition pipeline. In (Yousefzadeh et al., 2024), Yousefzadeh & Cao present a procedure for slicing complex olympiad proofs into smaller, not-automatically-solvable "building block" lemmas. The specific process is as follows:

Select IMO problems for which no prior public Lean formalization existed—12 problems ranging from 1959 to 2023.
Write full, original Lean 3/4 proofs for each problem (3,100 lines in Lean 3/4; 5,150 lines if both variants are counted).
Decompose these proofs by meticulously extracting all nontrivial intermediate claims:
- Each lemma requires at least two tactic lines;
- Lemmas must be out-of-reach for Lean's default solvers (e.g., not instantly dispatched by simp, linarith, hint, etc.).
- Both direct claims and "granted extra hypotheses" variant lemmas are generated to cover plausible subgoal configurations in actual proof search.
The result: a library of 907 systematically named lemmas (e.g., imo_1959_p1_l1) covering ∼25,480 lines of Lean4 code across 12 problems, each lemma Lean-verified and accompanied by full metadata (problem, year, code lines, tags).

This lemma dataset is designed to be highly modular and enables precise, failure-mode-resolved evaluation of AI provers.

3. Dataset Organization, Metadata, and Public Access

The organization of NuminaMath-LEAN resources emphasizes reproducibility, transparency, and downstream automation. Each problem is anchored in a dedicated directory, with:

The full Lean source of the main proof (e.g., 1959_p1.lean);
An extracted lemmas file (e.g., 1959_p1_lemmas.lean) listing and proving all decomposed lemmas;
Systematic lemma naming conventions (imo_YEAR_PROBLEM_lN) aligning code, metadata, and documentation;
A metadata.json enumerating, for each file: number of lemmas, line counts, tags, and affiliations with miniF2F, IMO year, and topics;
A datasheet.md using the Gebru dataset template to record composition steps, provenance, and comprehensive licensing information (MIT license);
All code checked by Lean 4 in batch builds, guaranteeing proof soundness.

The full dataset is available at https://github.com/roozbeh-yz/IMO-Steps.

4. Evaluation with LLMs

The NuminaMath-LEAN pipeline incorporates zero-shot and chain-of-thought experiments that reveal the precise strengths and failure modes of state-of-the-art LLMs, specifically GPT-4 (Yousefzadeh et al., 2024). The findings include:

For pre-1980s problems (the first six in the set), GPT-4 produces correct informal proofs for ∼30–75% of lemmas and valid Lean tactic proofs ∼15–60% of the time. Involvement of lemma retrieval and expert feedback can nudge these scores up by 5–10 points.
For problems dated 1984 and later, GPT-4's Lean proof accuracy is effectively zero, with frequent hallucinations of nonexistent lemmas or tactic names and foundational failures in approach and application.
Diagnosis of common LLM failure patterns:
- Hallucination of undefined theorems or tactics (e.g., "nat.dvd_of_factorization_eq");
- Misapplication of Mathlib lemmas;
- Syntactic issues, such as argument order or bracket omission;
- Misguided problem-solving strategies.
Roughly 20% of GPT-4's successful proofs are copied verbatim from public Mathlib; 67% are minor edits of existing code, and only ∼13% represent genuinely nontrivial generalization.
These outcomes justify a mid-level (lemma-centric) curriculum for future training and evaluation, pointing toward supervised and retrieval-augmented LLM architectures.

5. Interface to Automated Theorem Proving, Retrieval, and Planning

The modular structure of the lemma library enables direct integration with both retrieval-based and tactic-prediction models. The authors envision (and partly implement):

Use of the lemma set as an intermediate-scale benchmark for AI-guided Lean proof search, with each lemma representing a single, granular challenge out-of-reach for basic automation;
Deployment in retrieval-augmented LLMs, hybrid search systems (e.g., Copilot for Lean), and joint search/tactic models;
Iterative composition: a proposed “full-proof” pipeline operates by constructing a high-level plan sketch, retrieving relevant lemmas, executing candidate tactic sequences, and using backtracking and context refinement to converge on a solution;
Expansion of the resource to cover all 40 miniF2F IMO problems by systematically applying the decomposition methodology, and porting to alternative proof assistants (Isabelle, Coq) for cross-environment benchmarking;
Foundations for robust proof planning tools remain open, as the current resource does not yet formalize metaplanning or lemma selection/sequencing.

6. Strengths, Limitations, and Roadmap

NuminaMath-LEAN as currently constituted exhibits several potent features and known limitations:

Category	Covered	Not Covered / Limitations
Domains	Number theory, algebra	Geometry, combinatorics (decomposition pending)
Proof types	Non-trivial, Lean-hard lemmas	Trivial goals solved by native Lean tactics
Pipeline	Lean 3/4 formalizations, decomposition, code	Full proof planning, geometry tactics, deep pipeline integration pending
Usage	GPT-4 and LLM evaluation, human study	Automated, end-to-end AI theorem proving

The resource's next-phase roadmap includes:

Integration of the lemma library as an organized, public "IMO lemmas" subfolder in Mathlib;
Fine-tuning and evaluation of code LLMs (e.g., Code-Llama) on the 25k-line corpus;
Development of proof-planning pipelines and expansion of coverage to combinatorics/geometry via SMT-enhanced decomposition (drawing on precedents such as AlphaGeometry);
Maintenance and adaptation to ongoing developments in the Lean ecosystem (notably Lean 5 and evolving Mathlib APIs).

7. Impact and Connection to the Broader Formalization Landscape

NuminaMath-LEAN is a strategic nucleus for AI-for-maths research aiming to close the gap between pure formalization and LLM-driven math reasoning. Its lemma-centric curriculum directly addresses the granularity mismatch between full-problem datasets (too hard, yielding few positive AI completions) and synthetic data (often too simple or uninformative). The project's methodology has informed iterative data-generation and translation strategies as seen in the Lean Workbook project (Ying et al., 2024), which employs active learning, NLI filtering, and synthetic-to-formal translation to build a larger but less carefully curated dataset.

By establishing a blueprint for future resource scaling—across domains, assistants, and levels—NuminaMath-LEAN organizes proof data, decomposition, metadata, and evaluation in a principled, extensible manner, facilitating reproducible research and empirical progress in mechanized mathematical reasoning.

Markdown Upgrade to Chat

References (2)

A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems (2024)

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NuminaMath-LEAN.