LEAN-GitHub: Formal Math Proof Dataset
- LEAN-GitHub is a large-scale dataset of human-written Lean 4 proofs, offering diverse, tactic-level formalizations from GitHub.
- It employs an automated pipeline that compiles and extracts 28,597 theorems and 218,866 tactic steps from 6,352 source files.
- Integrating real-world GitHub proofs with synthetic data, the dataset enhances LLM training and improves neural theorem-proving benchmarks.
LEAN-GitHub is a large-scale dataset and extraction pipeline targeting human-written formal mathematics in Lean 4 collected from public GitHub repositories. It is designed to supply high-quality formal-theorem proving data for training and benchmarking LLMs and theorem-proving agents, responding to recognized bottlenecks in data scarcity for formal mathematical reasoning. By automating compilation and tactic extraction at scale, LEAN-GitHub enables state-of-the-art performance on problem corpora spanning high-school to undergraduate and competition-level mathematics (Wu et al., 2024).
1. Scope and Objectives
LEAN-GitHub was developed to address the underutilization of human-written Lean 4 formalizations on GitHub, most of which were absent from prior open datasets. The core rationale is that existing training resources for neural theorem provers (e.g., Mathlib (Community, 2019), Lean Workbook (Ying et al., 2024), miniF2F) either lack diversity, omit tactic-level proof data, or are synthetic in origin. LEAN-GitHub compiles, parses, and aggregates proofs extracted from Lean 4 repositories, maximizing both repository and theorem coverage. The resulting corpus directly supports the fine-tuning of transformer-based theorem-proving models and exhaustively benchmarks their performance on diverse mathematical domains (Wu et al., 2024).
2. Data Collection, Preprocessing, and Extraction Pipeline
2.1 Repository Curation and Compilation
The dataset’s construction began with a programmatic search of GitHub for repositories labeled as "lean," yielding 237 candidates. Since GitHub does not distinguish Lean 3 from Lean 4, keyword filtering for “theorem” and “lemma” was used to estimate relevant content. The curation process identified multiple obstacles:
- Many repositories exhibited uncompiled code due to improper configuration of Lake (Lean 4’s project manager), missing dependencies, or outdated syntax.
- Of 147 valid Lean 4 repositories, only 61 built successfully without intervention; the remainder were patched either by dependency-heuristics (matching project and Lean versions) or by treating standalone
.leanfiles as isolated mini-projects. - Deprecated syntax or non-tactic-based proofs led to the exclusion of ninety repositories.
2.2 Parallel Compilation and Import Graph Augmentation
Rather than relying on Lake—whose all-or-nothing build semantics and concurrency bottlenecks were suboptimal—a patched version of Lake exposed each project’s import directed acyclic graph (DAG). The DAG was augmented to include isolated .lean files, and many parallel invocations of leanc were issued directly, maximizing throughput and recovering from independent build failures.
Of 8,639 source files, 6,352 were successfully compiled, encompassing 42,000 theorems. Tactic-augmented, human-written proofs were obtained from 2,133 files, producing a final yield of 28,597 theorems and 218,866 individual tactics (approximately 0.131 billion tokens).
2.3 Tactic and State Extraction
Extraction of tactic-level supervision leverages and extends LeanDojo, but with improved parallelism and the ability to process isolated files. Each valid proof is decomposed into a sequence of tactic applications, and after every application, the intermediate goal state is captured. Hypotheses are renamed following kernel storage order, enabling effective de-duplication; otherwise, over 50% of such states are redundant.
No explicit BNF grammar is provided for Lean 4 tactic scripts; all extraction is performed via the Lean metaprogramming API. The proof step structure may be sketched as:
However, the pipeline does not formalize or parse this grammar explicitly; it is derived dynamically from Lean’s kernel (Wu et al., 2024).
2.4 Dataset Alignment and Comparison
A comparative overview with prior resources is as follows:
| Dataset | Theorems | Tokens | Intermediate States | Domain Level | Open-source |
|---|---|---|---|---|---|
| Lean-Workbook | 57,000 | 0.029 B | No | Undergraduate | Yes |
| Deepseek-Prover | 870,000 | 3.108 B | No | Undergraduate | No |
| miniF2F-curriculum | 327 | 1.5 K | No | High-school | Yes |
| LeanDojo-Mathlib | 60,000 | 0.138 B | Yes | Diverse | Yes |
| LEAN-GitHub | 28,597 | 0.131 B | Yes | Diverse | Yes |
The average proof length, calculated from extracted proof steps (tactics per theorem), is ≈7.7, though the paper does not report line-of-code statistics.
3. Model Training and Architectural Details
The principal usage for LEAN-GitHub is in fine-tuning LLMs for tactic prediction and lemma synthesis. The main model, InternLM-Math-Plus (7B), is a decoder-only transformer continually pre-trained on 200B tokens of mathematical text (formal and informal).
3.1 Training Protocol
Each example in training is structured as:
- DECL <theorem_name>
- GOAL <current_goal_state>
- PROOFSTEP <next_tactic>
with the loss:
Hyperparameters include a global batch size of 512, a peak learning rate of , linear warm-up over the first 3% of updates, followed by cosine decay, and two training epochs over the mixed dataset (human and synthetic, totaling ≈6 hours on 32×A100 GPUs).
3.2 Training Data Composition and Ablation
- Human-written: LEAN-GitHub (0.131B tokens), Mathlib via LeanDojo (0.131B tokens)
- Synthetic: Rule-based synthetic equation and inequality proofs from Lean Workbook (≈1.143B tokens) (Ying et al., 2024)
Ablation regimes systematically mix and compare the effect of these data sources:
- Mathlib only
- Mathlib + LEAN-GitHub
- Mathlib + synthetic
- Mathlib + LEAN-GitHub + synthetic
4. Benchmarking and Performance Assessment
4.1 Proof Search Methodology
Proof search employs best-first tree search, with up to tactic candidates per state and expansions per iteration. Intermediate goals are systematically de-duplicated through hypothesis normalization.
4.2 Benchmarks and Results
Key results are summarized below:
miniF2F (244 valid, 244 test; high-school to undergraduate):
| Method | Model Size | Pass@1 | Pass@64 |
|---|---|---|---|
| GPT-4-turbo (whole proof) | – | 23.0% | 25.4% |
| DeepSeek-Prover | 7B | 30.0% | 46.3% |
| Lean-STaR | 7B | – | 46.3% |
| InternLM-Math-Plus | 7B | 43.4% | – |
| Prover (LEAN-GitHub) | 7B | 48.8% | 54.5% |
ProofNet (371 undergraduate problems):
| Method | Model Size | Pass@1 |
|---|---|---|
| ReProver | 0.23B | 13.8% |
| Prover (ours) | 7B | 18.1% |
PutnamBench (640 competition-level):
No formal significance testing or confidence intervals are reported.
Ablation experiments show that the addition of LEAN-GitHub data consistently boosts performance, especially on diverse or challenging problem domains. The synthetic data alone, while ample, does not confer equivalent improvements on difficult or diverse tasks (Wu et al., 2024).
5. Impact, Limitations, and Future Directions
LEAN-GitHub demonstrates that systematically extracted, human-written Lean 4 proofs expand both the breadth and novelty of formal proof patterns available to LLM-driven theorem provers. Key findings include:
- De-duplication of intermediate proof states reduces wasted search effort by approximately 50%, with measurable increases in multi-pass proof rates.
- Mixing synthetic data alone is less effective than supplementing with actual GitHub-sourced proofs, particularly for complex or nonstandard domains.
- Coverage gaps remain: ≈90 repositories with valid Lean 4 code were excluded owing to version mismatches or missing dependencies, suggesting the current dataset is not fully exhaustive. Metrics such as lines of code per proof and average tactic depth are noted as future work.
- Future extensions proposed by the authors include expanding the extraction pipeline to Lean 3, Coq, Isabelle, stronger curriculum strategies (e.g., topic difficulty layering), and leveraging informal-to-formal prompting to further scale usable corpora (Wu et al., 2024).
6. Data and Implementation Resources
LEAN-GitHub is distributed via Hugging Face (https://huggingface.co/datasets/InternLM/Lean-GitHub) in JSONL format with standard splits—train (85%), valid (5%), and test (10%)—and a metadata file. The associated fine-tuned model (InternLM-Math-Plus) and proof search code are open-sourced at https://github.com/InternLM/InternLM-Math, with clear directory delineation for weights, extraction scripts, and evaluation harnesses.
By open-sourcing both data and prover code, LEAN-GitHub and its associated resources enable the community to leverage a previously under-tapped corpus of human-crafted formal mathematics, thereby advancing empirical research and practical capabilities in neural formal reasoning (Wu et al., 2024).