LeanBridge: Retrieval-Augmented Proof Synthesis
- LeanBridge is a retrieval-augmented proof synthesis framework that uses a retrieve–generate–verify loop with compiler feedback for formalizing proofs in Lean.
- It leverages semantic retrieval from the Mathlib library and a single LLM to generate and verify Lean 4 proofs, achieving up to 12% pass rates on formal tasks.
- The framework outperforms pure LLM prompting methods but still faces challenges in abstract reasoning and multi-step formalization in advanced category theory.
LeanBridge is a retrieval-augmented proof synthesis framework developed and evaluated in the context of formalized research mathematics, specifically targeting the Lean theorem prover. Designed for research-grade automation of abstract mathematics, LeanBridge integrates LLMs with semantic retrieval from the Mathlib library and a proof verification loop in Lean 4. It is benchmarked on the LeanCat-1 dataset, which consists of 100 fully formalized 1-category-theoretic tasks spanning multiple difficulty tiers. LeanBridge was introduced to advance beyond pure LLM prompting by explicitly leveraging library-mediated reasoning and verification cycles (Xu et al., 31 Dec 2025).
1. System Architecture and Core Workflow
LeanBridge implements a tight retrieve–generate–verify loop around a single LLM, with LeanExplore providing semantic search over Mathlib. The process is as follows:
- Retrieval: Given a problem statement—either in natural language or as a formal Lean declaration—LeanBridge issues it as a query to LeanExplore. The system returns a ranked list of relevant Mathlib objects (symbols, definitions, lemmas), which are ingested into the proof context.
- Generation: The LLM receives an enriched context comprising standard Lean imports, the import statements, universe annotations, the original problem, the set of retrieved snippets, and verification feedback from previous attempts. The LLM is then prompted to generate a Lean 4 proof script aiming to solve the statement.
- Verification: The candidate Lean script is submitted to the Lean 4 compiler (using Mathlib 4.19.0). Success is declared if the script type-checks, contains no
sorryoradmitplaceholders, and closes the theorem as intended. Compiler errors and feedback are parsed and incorporated into the next retrieval–generation cycle.
Each theorem is allotted up to four such cycles for both the statement-generation and proof-generation phases. An optional human-in-the-loop check filters out syntactically correct proofs that do not align semantically with the intended result (Xu et al., 31 Dec 2025).
2. Algorithmic Pipeline
The operational pipeline is formalized as follows, using pseudo-code:
1
No additional formal definitions of intermediate representations are specified; focus is placed on the retrieve–generate–verify abstraction (Xu et al., 31 Dec 2025).
3. Evaluation Metrics and Empirical Results
Evaluation employs the standard pass@k protocol (Chen et al. 2021), where, for problems, is the fraction of problems solved in up to independent attempts: where denotes whether attempt on problem type-checks and closes the theorem.
Results (Claude Opus 4.5 model, single-run baseline):
Partitioned by LeanCat-1 difficulty tiers:
| Tier | pass@1 (%) | pass@4 (%) |
|---|---|---|
| Easy | 32.5 | 50.0 |
| Medium | 4.17 | 4.76 |
| High | 0.0 | 0.0 |
LeanBridge consistently improves over single-model baselines (e.g., Claude Opus 4.5), but precise numerical gains are not tabulated—results are described only as “consistent gains.” No significance tests or confidence intervals are provided (Xu et al., 31 Dec 2025).
4. Comparative Analysis
The primary comparison is against “pure” LLM prompting without retrieval or feedback loops. The single-model baseline, Claude Opus 4.5, achieves and 0. Introducing LeanBridge yields additional solved theorems across all tested LLMs (Claude 4.5, GPT-5.2, Gemini 3 Pro, DeepSeek, Kimi K2). However, the paper does not state exact absolute or relative improvements; the gains are reported qualitatively as “consistent” across all models. The absence of quantitative breakdowns for LeanBridge is a direct result of the LeanCat paper’s focus (Xu et al., 31 Dec 2025).
5. Key Implementation Decisions
Several considerations are critical to LeanBridge’s design:
- Semantic Retrieval: LeanExplore is used to search the complete Mathlib 4.19.0 index for relevant declarations and lemmas, informing both the prompt context and potential proof strategies.
- Feedback Loop: Up to four cycles of retrieval, generation, and verification are allowed per task instance.
- Human-in-the-Loop Verification: Proofs that pass Lean’s kernel are optionally filtered for semantic fidelity to exclude trivial or misaligned “green” proofs.
- Context Pipelining: Each LLM prompt includes the full Lean file header, universe-level metadata, all relevant imports/retrieved declarations, and accumulated compiler feedback.
- Model Agnosticism: Experiments are run separately with multiple leading LLMs, but no ensemble techniques are employed; each run uses a single LLM instance. (Xu et al., 31 Dec 2025)
6. Limitations, Insights, and Future Directions
LeanBridge concretely demonstrates that retrieval augmentation (LeanExplore) plus compiler-feedback loops yield higher research-level formalization rates compared to pure LLM prompting, establishing the value of library grounding. However, significant challenges remain:
- Library Awareness: The precision of relevant lemma recall remains limited, constraining proof synthesis on abstract or rarely indexed theorems.
- Abstraction Control: LLMs have difficulty maintaining diagrammatic or universal-property reasoning essential for high-level category theory, instead regressing to elementwise arguments.
- Long-Horizon Planning: LeanBridge struggles to connect multi-step proof strategies, hampering performance on “Medium” and “High” difficulty tasks.
No Medium or High tier theorems were solved by baselines or LeanBridge. This suggests that even with integrated tool support, current LLM+retrieval architectures are insufficient for true research-level formalization in Lean, particularly in category-theoretic domains.
The authors propose future work in richer retrieval paradigms, hierarchical decomposition, stronger abstraction-aware planner integration, and potential multi-agent or multi-context extension as LeanCat evolves towards higher-complexity benchmarks (e.g., monoidal categories, bicategories) (Xu et al., 31 Dec 2025).
7. Context within Verified Synthesis and Related Systems
LeanBridge exemplifies the trend toward “retrieval-augmented theorem proving,” comparable in broad aims to structured prompting approaches such as BRIDGE (which decomposes synthesis into Code, Specifications, and Proofs domains, with prompt-scaffolded transitions) (George et al., 26 Nov 2025). Unlike BRIDGE, which emphasizes semantic chain-of-thought and multiple intermediate representations, LeanBridge’s innovation is the integration of semantic retrieval (via LeanExplore) and iteration within a type-theoretic framework.
A plausible implication is that combining structured domain-guided reasoning (as in BRIDGE) with aggressive retrieval loops (as in LeanBridge) may be required to achieve substantially higher rates of verified proof generation in interactive theorem proving environments. However, the LeanCat/LeanBridge evaluation underscores that progress remains incremental: the hardest research-level benchmarks resist current generation-retrieval-verify pipelines (Xu et al., 31 Dec 2025, George et al., 26 Nov 2025).