LeanDojo Benchmark 4

Updated 24 January 2026

LeanDojo Benchmark 4 is a standardized evaluation suite that tests automated theorem proving and autoformalization using novel premise splits and curated Lean4 tasks.
It employs rigorous protocols and metrics, such as success rate and proof-checking, to assess the performance of LLMs in handling out-of-distribution reasoning.
State-of-the-art systems leverage symbolic mutations and retrieval-augmented models to enhance compositionality and generalization in formal mathematics.

LeanDojo Benchmark 4 is a standardized evaluation suite within the LeanDojo family designed to assess the capabilities of automated theorem proving and autoformalization systems in Lean, particularly focusing on LLMs and retrieval-augmented architectures. It comprises two closely related usages in the literature: (1) the "novel_premises" split in formal Lean proving (as in Alchemy and the original LeanDojo benchmarks) and (2) a collection of autoformalization tasks in Lean4, testing the translation from informal to formal mathematics. Both reflect ongoing challenges in out-of-distribution reasoning and compositionality in formal mathematics, with formal grounding in mathlib (Lean’s mathematical library) and robust evaluation metrics tailored to the Lean ecosystem (Wu et al., 2024, Yang et al., 2023, Petrovčič et al., 24 Oct 2025, Gulati et al., 2024).

1. Design and Scope of LeanDojo Benchmark 4

LeanDojo Benchmark 4, in the context of theorem proving, is defined by the "novel_premises" split: each theorem requires at least one premise not encountered during training, necessitating generalization rather than memorization. This split is extracted from Mathlib4, the main formal mathematics library for Lean4, and filtered for Lean initializability, resulting in a set of 1,659 test theorems spanning domains such as algebra, analysis, topology, and number theory (Wu et al., 2024).

Parallel to this, an alternative instantiation of Benchmark 4 targets Lean4 autoformalization: a suite of 101 tasks where each asks for the formalization, as Lean4 code (theorem or definition plus proof), of a short, informal mathematical statement. These are hand-curated from mathlib4, covering 17 mathematical subject areas and stratified by complexity (Easy, Medium, Hard), thus targeting both statement and proof synthesis fidelity (Gulati et al., 2024).

2. Task Definitions and Evaluation Protocols

The benchmark comprises two primary task formats, both aimed at stress-testing LLMs' generalization in formal mathematics:

Theorem Proving (Novel Premises): Given a target theorem, models must generate Lean tactic sequences yielding a valid proof, with verification by the Lean proof assistant under a best-first search. Evaluation is based on the success rate $S = \#(\text{theorems proved}) / 1659$ , under a search budget of $N \times S \times T = 1 \times 32 \times 100$ (one attempt, 32 candidate tactics per state, up to 100 steps), capped at 10 minutes per theorem (Wu et al., 2024, Yang et al., 2023).
Autoformalization: Each task presents an informal statement to be formalized as Lean4 code. Metrics include correction-effort (human edits required, 0–4 scale), exact-match accuracy ( $\mathrm{Acc}$ ), token-level $F_1$ , normalized edit distance ( $\mathrm{NormED}$ ), and proof-check success rate ( $\mathrm{PCheck}$ : fraction of outputs that typecheck in Lean4) (Gulati et al., 2024).

The novel_premises split fundamentally tests compositionality and robust reasoning, as models must infer how to use or adapt unseen lemmas rather than merely reproduce memorized tactic patterns.

3. Data, Benchmarks, and Experimental Infrastructure

The foundation of the benchmark is the extraction of annotated Lean proofs from mathlib (both Lean 3 and Lean 4), with explicit premise–tactic–goal relations, source code snapshots, and fine-grained semantic labels. In mathematical theorem proving, the "novel_premises" split is constructed algorithmically: no test theorem may access a lemma appearing in any training proof, ensuring disjoint premise usage (Yang et al., 2023).

In autoformalization, each entry links an English paraphrase (informalization) and a ground-truth Lean4 snippet, with domain and complexity annotations. Examples include statements such as "The Hamming distance of an element to itself is always 0" and their corresponding Lean4 encoding (Gulati et al., 2024).

Key data statistics are shown below:

Benchmark View	Theorem Proving	Autoformalization
# Examples	1,659 (novel_premises)	101
Source	Mathlib4	Mathlib4 informalized
Domains	Algebra, Topology, ...	17 subjects
Output Format	Tactic sequence	Lean4 code (decl+proof)

4. Methodologies and Modeling Advances

State-of-the-art systems on LeanDojo Benchmark 4 leverage various neural and retrieval-augmented architectures:

Alchemy Dataset Augmentation: The Alchemy framework synthesizes ≈6.3M new theorems via symbolic mutation (rewrite "rw" and implication "apply" tactics), expanding the original Mathlib corpus by up to ×25–×44. This synthetic corpus enables continual pretraining and supervised proofstep fine-tuning of transformer models (e.g., Llama 3 8B, deepseek-coder-7B), and drives absolute performance improvements on the benchmark of +4.7 pp for Llama 3 8B (Wu et al., 2024).
Retrieval-Augmented Models: ReProver combines dual-encoder text retrieval with fine-grained proof state-pair labeling, enabling targeted premise selection. The graph-augmented model in (Petrovčič et al., 24 Oct 2025) further enhances premise selection via a Relational Graph Convolutional Network (RGCN) that propagates information over heterogeneous dependency graphs $G=(V,E,R,X)$ reflecting both symbol relations and proof dependencies in Mathlib.
Autoformalization: Zero-shot prompting of GPT-3.5, GPT-4, and Gemini Pro tested their ability to translate informal mathematical English into valid Lean4 code snippets, without fine-tuning or retrieval support (Gulati et al., 2024).

5. Quantitative Results and Ablation Analyses

On the theorem-proving variant, the introduction of Alchemy-style symbolic mutations increases success rates (fraction of theorems proved) from 38.52% to 43.22% for Llama 3 8B (base vs. +rw+apply), greatly surpassing built-in heuristics (tidy: 5.3%) and GPT-4 (7.4%) and approaching upper bounds set by retrieval-augmented baseline ReProver (26.3%) (Wu et al., 2024, Yang et al., 2023). The synergistic effect of combining rewriting and implication-based mutations appears nearly additive in gains.

For premise selection, the GNN-augmented retriever reports Recall@1 improvement from 13.42% (ReProver baseline) to 17.98% (+33.98%), with Recall@10 and MRR showing over 25% relative improvement. This underscores the benefit of modeling library structural dependencies (Petrovčič et al., 24 Oct 2025).

In autoformalization, current LLMs deliver moderate performance: correction-effort ≈2.24 (on a 0–4 scale), exact-match accuracy under 10%, proof-check success under 20%, and best per-topic performance on elementary logic and information theory. Hardest domains are category and model theory (Gulati et al., 2024).

6. Challenges, Limitations, and Future Directions

LeanDojo Benchmark 4 exposes several persistent obstacles:

Generalization: Models exhibit noticeable performance degradation on novel_premises compared to random splits, highlighting the data scarcity and long-tail premise induction challenges (Yang et al., 2023, Wu et al., 2024).
Structural Reasoning: While RGCN-based approaches provide significant gains, deeper GNN variants, integration of file/import DAGs, and joint text–structure co-training remain largely unexplored (Petrovčič et al., 24 Oct 2025).
Autoformalization Limitations: LLMs struggle with complex domains, fragile Lean syntax, and external lemma invocation without few-shot or retrieval augmentation (Gulati et al., 2024).

Proposed future work includes:

Expanding mutation and augmentation pipelines (e.g., multi-step or semantic-preserving transformations),
Developing advanced Graph Neural Networks (GAT, Relational Graph Transformers) for richer dependency modeling,
Incorporating scoping information and online file-based filtering for premise access,
Extending benchmarks to full proof synthesis and out-of-distribution evaluation, and
Exploring fine-tuning, LoRA adaptation, and retrieval-based prompting for autoformalization tasks.

7. Significance and Research Impact

LeanDojo Benchmark 4 has established itself as a cornerstone for evaluating out-of-distribution formal reasoning capacities in LLMs, guiding both algorithmic and dataset innovations in automated theorem proving. By enforcing premise disjointness and supporting comprehensive annotation, it enables reproducible, robust benchmarking across textual, structural, and compositional axes. The rigorous protocol and transparent reporting (success rate, proof-check, correction effort, retrieval metrics) have accelerated the development of stronger, more generalizable neural provers, and provided a foundational resource for the integration of language modeling and symbolic mathematical reasoning (Wu et al., 2024, Yang et al., 2023, Petrovčič et al., 24 Oct 2025, Gulati et al., 2024).

Markdown Upgrade to Chat

References (4)

Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation (2024)

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models (2023)

Combining Textual and Structural Information for Premise Selection in Lean (2025)

An Evaluation Benchmark for Autoformalization in Lean4 (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LeanDojo Benchmark 4.