LEAP: Agentic LLMs in Lean Theorem Proving
- LEAP is a formal reasoning system that embeds LLMs in Lean to achieve advanced agentic theorem proving.
- It combines orchestration layers, semantic retrieval, and dynamic planning to decompose and verify proofs effectively.
- The system demonstrates state-of-the-art performance with cost savings and adaptability in complex mathematical tasks.
LEAP (LLM-in-Lean Environment Agentic Prover) denotes a class of formal reasoning systems and architectures that embed LLMs inside the Lean proof assistant, employing an agentic paradigm to achieve state-of-the-art results in automated formal mathematics, mathematical research formalization, and contextually complex theorem proving. LEAP systems augment general-purpose LLMs with orchestration layers, verification feedback, dynamic planning, retrieval from large proof corpora, and tool-augmented workflows to bridge the gap between informal reasoning and mechanized, kernel-checked proofs. They encapsulate a fundamental shift from monolithic, pipeline-style provers toward plug-and-play, blueprint-driven, and highly modular agentic agents that can address a broad range of formal tasks without fine-tuning or specialized model training (Liu et al., 20 Jan 2026, Kung et al., 2 Jun 2026, Varambally et al., 26 Sep 2025).
1. Architecture and Agentic Loop
LEAP systems are fundamentally multi-component loops, with the LLM as "agent," tightly coupled to the Lean kernel and a set of orchestration and retrieval tools. Core elements include:
- Base LLM (“brain”): A general-purpose coding or foundation model (e.g., Claude Code, Gemini-3.1-Pro) acts as the planner and primary reasoner.
- Controller/Dispatcher (e.g., Numina-Lean-MCP): Routes proof steps, triggers tool calls, manages interaction with proof state and tool feedback.
- Lean + OS environment: Acts as the backend for proof verification, compilation, and tactic execution.
- Retrieval Modules: Provide access to theorem indices, Mathlib semantic search (e.g., lean_loogle), and premise selection.
- Auxiliary Tools: Informal prover, discussion partner, and custom LR or RL agents.
The canonical high-level dataflow is:
1 2 3 4 5 6 7 8 |
Model (LLM) ↓ Controller (MCP) ↓ Lean (Kernel, Tools) ↓ Environment/Feedback ↺ (to Model) |
At each iteration, the model plans the next proof step, the controller invokes relevant tools (e.g., checking diagnostics, searching for lemmas), Lean executes tactics and provides feedback, and the agent re-plans accordingly (Liu et al., 20 Jan 2026).
The loop is extended in advanced LEAP systems by outer architectural constructs such as AND-OR DAG memory (for managing decomposition and blueprints), recursive decomposition (for complex goal splitting), and explicit reviewer/self-reflection modules (Kung et al., 2 Jun 2026, Varambally et al., 26 Sep 2025).
2. Lean Integration and Tooling
LEAP architectures use Lean’s LSP (Language Server Protocol) and custom MCP (Model Context Protocol) to expose internal proof state and control to the LLM:
- Commands:
lean_goal(query current proof goal),lean_diagnostic(type errors/completeness),lean_run_code(execute tactics),lean_file_outline(file structure),lean_multi_attempt,lean_local_search,lean_loogle(semantic NL-theorem search). - Proof-state feedback: Returned after each tactic or tool call, enabling the LLM to adjust actions in a closed-loop system.
- API Design: All LLM actions ultimately correspond to edit or query operations on Lean files, mediated by controller/dispatcher logic (Liu et al., 20 Jan 2026, Tredici et al., 14 Oct 2025).
Systematic use of these tool APIs enables agents to perform:
- Adaptive tactic selection (with in-context error correction)
- Semantic and local retrieval for lemma search and premise selection
- Multi-step interactive proof construction and error-driven refinement
3. Decomposition, Proof Generation, and Self-Refinement
A defining feature of LEAP is hierarchical task decomposition and iterative refinement:
- Blueprint-based Decomposition: Agents first propose an informal "blueprint" DAG (graph) of intermediate lemmas and strategies. The decomposer agent formalizes these into subgoals, introducing
sorryplaceholders only for genuinely new sublemma nodes. - Direct and Indirect Proof Generation: For each open goal, the agent attempts direct proof; on failure, it recursively decomposes, translates blueprints into Lean code, and refines subgoals (Kung et al., 2 Jun 2026, Varambally et al., 26 Sep 2025).
- Self-Refinement Loop: Lean kernel errors are fed back into the model as prompts, prompting minimal and targeted code edits, tactic adjustment, import correction, or subplan revision. Dead ends or cycles are detected and pruned using acyclic DAG memory and LLM-based reviewers.
- Verifier-Driven Orchestration: Every proof script is submitted to the Lean kernel for type-correctness. Only formally validated steps are allowed, ensuring mechanical soundness at all times (Kung et al., 2 Jun 2026, Varambally et al., 26 Sep 2025, Liu et al., 20 Jan 2026).
Examples of common proof workflows:
- Informal sketch generation (for metastructural planning)
- Tactic selection exploiting Mathlib and retrieved theorems
- Recursive goal splitting, with deep decomposition for multi-step IMO or Putnam-scale theorems
4. Retrieval Integration and Cost-Aware Planning
LEAP systems universally augment the LLM with semantic retrieval and cost-aware resource allocation:
- Semantic and Local Retrieval: Lean4 tools (e.g., FAISS indices, Mathlib semantic search via lean_loogle) return candidate lemmas or theorems by cosine similarity or pattern match, which are filtered/adapted by the agent at each step (Varambally et al., 26 Sep 2025, Liu et al., 20 Jan 2026).
- Action Routing Agents: Recent enhancements optimize the compute-quality tradeoff by introducing a bifurcated architecture—data plane (lemma-style proof generation) and control plane (logistic-regression router estimating marginal success/cost from attempt history), yielding up to 25.8% cost savings at fixed accuracy (Rögnvaldsson et al., 3 Jun 2026).
- Parameterized Routing: Policies are learned from trajectories with features (proof similarity, error diversity, attempt count), dynamically terminating dead-ends and prioritizing promising branches, with possible future extensions to richer meta-reasoning and multi-model escalation (Rögnvaldsson et al., 3 Jun 2026).
5. Evaluation, Benchmarks, and Empirical Performance
LEAP systems systematically set state-of-the-art results on diverse, challenging formal benchmarks without fine-tuning:
| Benchmark | LEAP (Recent SOTA) | Comparative Systems | Metric | arXiv ID |
|---|---|---|---|---|
| Putnam 2025 | 12/12 solved | Aristotle: 10/12 | Pass@1 | (Liu et al., 20 Jan 2026) |
| Lean-IMO-Bench | 70% one-shot (Adv.) | Specialist IMO: 48% | One-shot/rollout=2 | (Kung et al., 2 Jun 2026) |
| MiniF2F | 99.2% (pass@4) | Goedel-Prover: 74.6% | Pass@4 | (Varambally et al., 26 Sep 2025) |
| ProverBench | 58.2% (32B OProver) | Goedel-V2-32B: 51.0% | Pass@32 | (Ma et al., 17 May 2026) |
| FormalQualBench | 10/23 | OpenGauss: 8/23 | End-to-end closure | (Li et al., 26 May 2026) |
Putnam 2025 and Lean-IMO-Bench are used for research-level evaluation, while miniF2F and ProverBench provide large-scale validation for Olympiad and general mathematics (Kung et al., 2 Jun 2026, Liu et al., 20 Jan 2026, Varambally et al., 26 Sep 2025, Li et al., 26 May 2026).
Strengths demonstrated by LEAP hierarchies include:
- Superior solve rates, frequently outperforming single-model baselines (e.g., Gemini-3.1-Pro, Goedel-Prover-V2).
- Efficiency in both proof length (Lean lines of code) and wall-clock performance.
- Robustness to domain shift, including success on tasks in quantum algebra and research-level combinatorics (Varambally et al., 26 Sep 2025, Tredici et al., 14 Oct 2025).
6. Applications, Generalization, and Extensions
The agentic paradigm instantiated by LEAP generalizes to multiple proof assistant platforms and task modalities:
- Research-level mathematics: Large-scale formalizations (e.g., Brascamp–Lieb theorem, Hamiltonian decompositions) with human–AI collaboration and agentic self-correction (Liu et al., 20 Jan 2026, Kung et al., 2 Jun 2026).
- Scientific domains: Application to quantum theorem proving, cryptography formalization, and compliance verification in financial systems (Tredici et al., 14 Oct 2025, Rashie et al., 1 Apr 2026).
- Cross-assistant and multi-agent deployments: Blueprint decomposition and agentic loops ported to Coq, Isabelle/Isar with minor kernel adaptation (Kung et al., 2 Jun 2026, Ma et al., 17 May 2026).
- Cost-aware and scalable reasoning: Routing agents and meta-controllers for dynamic resource allocation, model switching, and self-debugging.
Ongoing and proposed future directions include hybrid model architectures (foundation models for planning, fine-tuned provers for local proof obligations), reinforcement learning for branch prioritization, synthetic-geometry tool integration, and distributed agent hierarchies (Kung et al., 2 Jun 2026, Rögnvaldsson et al., 3 Jun 2026, Li et al., 26 May 2026).
7. Limitations and Open Challenges
Despite dramatic gains, several challenges persist:
- Proof elegance and abstraction: LEAP-generated proofs remain verbose, low-level, and sometimes lack the abstraction or idiomatic structure typical of expert human formalization (Liu et al., 20 Jan 2026).
- Type-level and domain-specific errors: Type mismatches (e.g.,
RealvsNNReal) or nuanced conversion issues often require pre-processing or manual human guidance. - Long-horizon scalability: Deeply nested theorems can lead to intractable AND-OR search spaces, with branching-factor blowup unless aggressively pruned or prioritized.
- Automated incremental memory: Most implementations lack persistent cache/memory of past proofs, limiting re-use and accelerating only local search (Tredici et al., 14 Oct 2025, Li et al., 26 May 2026).
A plausible implication is that further development of LLM-in-Lean agentic provers will focus on increasing proof abstraction, introducing domain heuristics, and scaling harnesses for distributed, parallel agent deployment.
References:
- "Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics" (Liu et al., 20 Jan 2026)
- "LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks" (Kung et al., 2 Jun 2026)
- "Hilbert: Recursively Building Formal Proofs with Informal Reasoning" (Varambally et al., 26 Sep 2025)
- "Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean" (Rögnvaldsson et al., 3 Jun 2026)
- "OProver: A Unified Framework for Agentic Formal Theorem Proving" (Ma et al., 17 May 2026)
- "MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving" (Li et al., 26 May 2026)
- "Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics" (Tredici et al., 14 Oct 2025)