Lean Workbook: Autoformalized Math Problems

Updated 2 January 2026

Lean Workbook is a curated resource of 57K verified NL–Lean pairs that seamlessly bridges informal math problems and formal Lean 4 statements.
It utilizes a robust autoformalization pipeline, combining seed data, model fine-tuning, and human-in-the-loop iterations to achieve a 93.5% weighted average accuracy.
The extended Lean-Workbook-Plus dataset significantly enhances LLM-based theorem proving benchmarks through detailed metrics like compile-pass and NLI-pass rates.

The Lean Workbook is a large-scale, auto-formalized problem set of mathematical statements curated for automated theorem proving in the Lean 4 formal language. Serving as a foundational training and evaluation resource for LLMs in formal mathematics, the Lean Workbook strategically addresses the scarcity of contest-style, formally encoded math problems by synthesizing human-verified Lean statements and natural language (NL) pairs at unprecedented scale. This resource underpins both data creation pipelines (autoformalization) and algorithmic advances in expert iteration for LLM-based theorem provers, as demonstrated in the InternLM-Math-Plus and InternLM2.5-StepProver lines of research (Ying et al., 2024, Wu et al., 2024).

1. Motivation and Data Landscape

Automated theorem proving requires bridging the gap between informal mathematical problem statements and their formal counterparts in languages such as Lean 4. While LLMs have shown advanced capabilities in mathematical reasoning across informal math benchmarks (e.g., GSM8K, MATH), their performance in fully formalized settings remains constrained by the limited availability of large, diverse, and high-quality Lean-annotated datasets. Manually encoding contest- or Olympiad-style problems into Lean is labor intensive. Existing collections such as MiniF2F and ProofNet cover only a few thousand examples and focus predominantly on foundational theorems, not the broad spectrum encountered in contests and advanced undergraduate curricula. The Lean Workbook directly addresses this data bottleneck by automating the translation, verification, and curation of formal-informal statement pairs at scale (Ying et al., 2024).

2. Autoformalization Pipeline and Dataset Creation

The Lean Workbook dataset is constructed through a multi-stage, active-learning pipeline designed to maximize both scale and formal correctness:

Seed Stage: All available Lean 4 statements (with placeholder proofs using proof := by sorry) and associated English descriptions were harvested from MiniF2F and ProofNet, augmented with additional tasks like tactic prediction.
Model Fine-Tuning: The InternLM-Math-Plus-20B model, pretrained on Lean content, underwent supervised fine-tuning on seed pairs for both translation directions (NL→Lean and Lean→NL) using large-scale GPU resources (32 NVIDIA A100 GPUs, 3 epochs, $4 \times 10^{-5}$ learning rate).
Active Learning from AoPS: Approximately 1.1 million posts from the Art of Problem Solving forum were classified for well-definedness, out of which 458,692 were tagged as valid math problems. Topic tags included inequality, number theory, algebra, etc. LLM-based filtering and translation produced 327,870 candidate Lean 4 statements.
Automated Filtering: Each Lean statement was checked for syntactic and semantic validity—first by compilation in Lean 4 with Mathlib4 (Compile-Pass, CPN), then by back-translation and natural language inference (NLI-Pass, NPN) to ensure semantic fidelity to the original NL statement.
Human-in-the-Loop Iteration: Statements failing either compilation or NLI checks were manually corrected and injected back into the training pool. Six rounds of such iterative refinement resulted in a final dataset of 57,231 NL–Lean pairs, with 341 supervised corrections in total and estimated weighted average accuracy of 93.5% across tags.

Canonical Lean statement examples synthesized by the pipeline include:

1 2	theorem ex_1 (n p : ℕ) (hp : Nat.Prime p) (hd : p ∣ n) : { (x, y) : ℕ × ℕ \| x + y = n ∧ Nat.gcd x y = p }.Finite := by sorry

1 2	theorem lem1 (a b c d : ℝ) (ha : 0 < a) … (habc : abc*d = 1) : (1/(1+(1+a)^2) + … + 1/(1+(1+d)^2)) ≤ (4:ℝ)/5 := by sorry

(Ying et al., 2024)

3. Dataset Structure, Coverage, and Extensions

The final Lean Workbook resource encompasses 57,231 verified NL–Lean pairs. Each entry comprises a formal Lean-4 theorem or lemma declaration and an English translation, with 5,000 pairs also featuring formal proofs generated by InternLM-Math-Plus (proof search with Pass@1024=8.6%, i.e., 4,898 solved statements). Additionally, 21 new IMO (International Mathematical Olympiad) problems, formalized and verified through this pipeline, extend the reach of the dataset.

The extended Lean-Workbook-Plus dataset, as referenced in the subsequent expert-iteration work, comprises 82,275 declarations spanning domains such as elementary number theory, algebra, real/complex analysis, topology, combinatorics, and undergraduate-level mathematics (Wu et al., 2024).

4. Model Training, Evaluation, and Benchmarking

Key evaluation metrics include:

Compile-Pass Number (CPN): The number of Lean statements compiling successfully.
NLI-Pass Number (NPN): Statements that both compile and pass NLI validation.
Human Correct Translation Rate: Manually sampled accuracy by domain experts.

Metric evolution for the core 327,870 NL problems:

Model Stage	CPN	NPN
First-round (seed only)	136,670	37,122
Final, +341 human corrections	205,079	57,231
Final + full Lean Workbook	228,928	82,893

Per-tag sampled formalization accuracy, e.g., inequality (10/10), algebra (9/10), and number theory (9/10), yields a weighted average of 93.5% (Ying et al., 2024).

Downstream, the InternLM2.5-StepProver framework leverages Lean-Workbook-Plus for large-scale expert iteration, yielding substantial advances:

MiniF2F-test (Lean-4 split, pass@256, BF+CG): 65.9%, compared to the previous best open-source result of 60.2%.
Lean-Workbook-Plus: 13.1% of problems proved (10,880); 3.9% disproved (3,195); total 17.0% either proved or disproved (14,075).
ProofNet: 27.0% pass, outperforming the prior best open result.
Putnam Benchmark: 6/640 solved, exceeding previous open systems without informal proof skeletons (Wu et al., 2024).

5. Expert Iteration, Critic-Guided Search, and Empirical Scaling Laws

InternLM2.5-StepProver implements an iterative refinement process to optimize LLM-based theorem proving:

Rapid Scan: Up to 10 best-first-search (BF) iterations per statement or 50 seconds wall-time; solved/disproved statements are pruned.
Self-training: Aggregates newly found proof trajectories (goal states + tactics) for supervised fine-tuning (7B SFT model, ≈2.19B tokens per full sweep).
Budget Increase: Allows up to 2,000 search iterations and 3,600 seconds per problem in later rounds, replacing long or ill-formed proofs as needed.
Critic-Filtered Search: A distinct critic model (InternLM2-chat-1_8b-sft, 8B) scores proof states $s$ with $V(s) \in \mathbb{R}$ for "distance to no_goals," filtering and prioritizing search on the most promising (top-50%). Critic-guided (CG) runs alternate with policy log-prob-based selection, systematically uncovering longer and deeper proofs (average proof length 4.44, vs. 1.66 for BF alone), boosting total problem coverage.

Empirically, two log-linear scaling laws emerged:

CPU time vs. problem count: For $C_{s_i} = \frac{\sum_k T_{s_i, k}}{\sum_k \text{valid}(s_i, k)}$ , the log count of problems solved scales linearly with log-average CPU spent, distinguishing "trivial" from "hard tail" problems.
Proof length vs. CPU time: Average CPU time per proof increases exponentially with minimal proof line count.

Only ≈1.5% of CPU-hours produced all successful proofs, with 98.5% expended on unproven statements, indicating steep resource demands as problem complexity increases (Wu et al., 2024).

6. Open-Source Resources, Limitations, and Future Directions

The codebase and full dataset are openly available (Apache 2.0):

Code: https://github.com/InternLM/InternLM-Math
Dataset: https://huggingface.co/datasets/InternLM/Lean-Workbook (DOI: 10.57967/hf/2399)

Current limitations include non-trivial deduplication of near-duplicate or paraphrased problem statements, limited generalizability to undergraduate or research-level problems beyond the contest genre, and under-representation of geometry due to noisy extraction of AoPS tags. Addressing these constraints will require broader expansion of mathematical domains, integration of tighter proof-search mechanisms in the autoformalization loop, enhancement of semantics for divisibility/minimality, and research into unsupervised validation strategies beyond NLI.

A plausible implication is that future data-centric advances, when paired with critic-guided expert iteration and scalable proof search, will facilitate further leaps in automated theorem proving, both in breadth of mathematical coverage and depth of proof synthesis (Ying et al., 2024, Wu et al., 2024).