Lean Copilot: LLM-Enhanced Theorem Proving

Updated 15 June 2026

Lean Copilot is an open-source framework that natively integrates large language models into Lean 4, enabling context-aware, automated proof generation.
It employs a native C++ FFI integration and beam search decoding to efficiently process Lean goals and deliver optimized tactic suggestions in real time.
Experimental results demonstrate significant reductions in manual intervention, leveraging diverse formal corpora to boost autonomous theorem proving performance.

Lean Copilot is a native, open-source framework that enables seamless integration of LLMs within the Lean 4 theorem prover, providing context-aware proof automation and copilot-style assistance directly in the interactive development environment. By addressing critical issues of data scarcity and real-time interaction and leveraging recent advances in formal proof corpora compilation, Lean Copilot delivers a robust solution for both researcher-driven and autonomous theorem proving in Lean (Song et al., 2024, Wu et al., 2024).

1. Native Architecture and Inference Integration

Lean Copilot embeds transformer-based LLMs—defaulting to the ReProver model (ByT5, 229M parameters, fine-tuned on LeanDojo)—natively within the Lean runtime via the Foreign Function Interface (FFI). The architectural flow is characterized by minimal overhead, leveraging a direct pipeline from Lean tactics (e.g., suggest_tactics) through C++ FFI to the underlying model inference engine, compiled via CTranslate2. This avoids Python subprocesses or RPC overhead in local mode:

Lean → C++ → LLM: Goal state is passed as a UTF-8 string from Lean to the LLM via FFI.
Beam Search Decoding: In C++, beam search (parameterizable for $k$ , temperature, etc.) is run to produce top-k output strings.
Lean Post-processing: Returned tactic texts or embeddings are parsed, type-checked, and displayed in Lean's infoview.

Lean Copilot exposes two core low-level APIs: $\mathrm{AutoRate} = \frac{T - h}{T}\times 100\%$ 3 These interfaces enable both text generation and goal state encoding for downstream tasks (Song et al., 2024).

2. Workflow Integration and Proof Automation Tactics

Lean Copilot introduces three out-of-the-box tactics that integrate LLM-powered proof automation into standard Lean workflows:

suggest_tactics: Queries the LLM for tactic suggestions on the current goal, type-checks results, and highlights suggestions in the infoview (green for goal closure, blue for goal progress).
search_proof: A drop-in replacement for the rule-based aesop search; during proof search, at each node, aesop may call LLM-generated, goal-dependent tactics. If a complete proof is found, all goals are closed.
select_premises: Encodes the current goal into a vector and uses a precomputed "premise embedding matrix" (via BLAS and Libnpy) to retrieve and rank top-k relevant Mathlib lemmas.

Usage is idiomatic to Lean’s interactive scripting, e.g., $\mathrm{AutoRate} = \frac{T - h}{T}\times 100\%$ 4 (Song et al., 2024)

3. Model Management, Deployment, and Extensibility

Lean Copilot ships with ReProver by default, but supports extensive model swapping:

Local inference via CTranslate2 for CPU/GPU, requiring only libctranslate2.
Server mode via a Python API server (e.g., HuggingFace Transformers). Users register an external generator by specifying host, port, and model.
Swapping models (e.g., to GPT4All, LLaMA2 via llama.cpp, StarCoder, etc.) is accomplished by repointing the model URL in the configuration without altering client code.

This modularity, combined with stable low-level interfaces (TextGenerator, TextEncoder), allows for rapid research iteration and custom LLM integration (Song et al., 2024).

4. Empirical Performance and Benchmarks

Experimental assessment of Lean Copilot on 50 exercises from "Mathematics in Lean" (average proof length 5.52 tactics) yields the following key metrics:

Method	Avg. # Human Tactics ↓	% Autonomous ↑	Avg. % Steps Automated ↑
aesop	3.62	12%	35.2%
suggest_tactics	2.72	34%	48.6%
search_proof	1.02	64%	81.2%

Automation rate is defined as: $\mathrm{AutoRate} = \frac{T - h}{T}\times 100\%$ where $T$ is the ground-truth tactic count and $h$ the number of user-entered tactics to success. The hybrid search_proof tactic outperforms the rule-based baseline by more than a factor of two in automation and substantially reduces human effort (Song et al., 2024).

5. Expansion of Training Data: Lean-GitHub Corpus

A critical advance toward scaling Lean Copilot’s capabilities is the introduction of the LEAN-GitHub dataset. This corpus systematically extracts 28,597 theorems and 218,866 tactics from 147 Lean 4 repositories (GitHub user projects, Mathlib 4, synthetic workbooks), bringing the total to 145,597 theorems and 698,866 tactics when combined with Mathlib and synthetic sources. Fine-tuning the 7B-parameter InternLM-math-plus model on this heterogeneous data yields the \prover{} model (Wu et al., 2024).

Source	# Theorems	# Tactics	# Tokens
Mathlib 4 (LeanDojo)	60,000	480,000	0.131B
GitHub (LEAN-GitHub)	28,597	218,866	0.138B
Synthetic (Workbook)	57,000	—	0.029B
Total	145,597	698,866	0.298B

This multiplicity of sources yields significant accuracy gains across benchmarks. On miniF2F, \prover{} achieves Pass@1 = 48.8% and Pass@64 = 54.5%, surpassing prior state-of-the-art (DeepSeek-Prover (7B), Pass@1 = 52.0%). ProofNet and PutnamBench are also matched or exceeded (Wu et al., 2024).

6. Search, Caching, and IDE Integration Best Practices

At inference, Lean Copilot and its successors employ best-first search over tactic sequences: from each state $S_i$ , $S=32$ candidate tactics are sampled, expanding up to $K=100$ states. Canonical state fingerprinting (renaming hypotheses by storage index) is used for deduplication, eliminating $>50\%$ redundant states.

For real-time IDE integration, guidance includes:

Maintaining a flash cache of the most recent 1,000 (goal, tactic) pairs for rapid completion.
Precomputing and caching top-5 tactic suggestions per goal fingerprint.
Utilizing highly quantized LLMs (e.g., 4-bit) for ultra-low-latency completions ( $\leq$ 200ms).
Continuous user feedback loop: logging accepted/refuted tactics to drive online tuning.

User-experience metrics recommended are suggestion latency (target <$300$ms), suggestion acceptance rate ( $\mathrm{AutoRate} = \frac{T - h}{T}\times 100\%$ 0), and time-to-proof reduction (average keystrokes saved) (Wu et al., 2024).

7. Limitations, Failure Modes, and Future Mitigations

Observed limitations include:

Deep, multi-stage Olympiad proofs may require $\mathrm{AutoRate} = \frac{T - h}{T}\times 100\%$ 1 expansions.
Non-linear, creative proof strategies are rare in training data.
Model may hallucinate tactics for unreachable states if not constrained by type-checking.

Recommended mitigation strategies:

Adaptive budget allocation increasing $\mathrm{AutoRate} = \frac{T - h}{T}\times 100\%$ 2 for high-value subgoals.
Online user-refinement for tactic ranking.
Incorporation of informal proof sketches to improve high-level planning (Wu et al., 2024).

A plausible implication is that further improvements will depend significantly on increasing the diversity and depth of formal corpora and on tightening the interaction between model-generated suggestions and Lean's kernel-level type-checking and verification routines.

Lean Copilot illustrates a paradigm shift in formal theorem proving platforms, demonstrating that a combination of scalable transformer-based models, diversified formal training data, native system integration, and user-centric tooling can substantially decrease the manual effort in Lean proofs without loss of correctness. Empirical results and the rapidly growing Lean-GitHub corpus suggest continual advances in both interactive and autonomous formal mathematics by leveraging such copilots (Song et al., 2024, Wu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean (2024)

LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lean Copilot.