LeanSearch: Formal Math Retrieval
- LeanSearch is a Lean 4–native semantic search engine for Mathlib that connects informal queries, live proof states, and formal declarations.
- It leverages dense embeddings, fuzzy matching, and LM-based reranking to accurately retrieve over 267,000 theorems and 127,000 definitions in an environment-aware manner.
- The system underpins agentic proving pipelines by integrating iterative sketch–retrieve–reflect cycles to support both local and global premise retrieval in formal proofs.
LeanSearch is a Lean 4–native, environment-aware semantic search engine over Mathlib, and, in later work, a retrieval system for global premise retrieval in Lean 4 theorem proving. In practice it serves as a bridge between informal mathematical ideas, live Lean goal states, and the existing library of theorems and definitions; in a broader practitioner sense, the term also denotes the problem of finding the right lemma, definition, or theorem from an informal query, a partially remembered statement, or a proof state (Ju et al., 4 Apr 2026, Gao et al., 13 May 2026, Lu et al., 8 Oct 2025).
1. Origin, scope, and problem setting
LeanSearch is situated in the broader problem of formal mathematical retrieval: given a Lean goal, an informal mathematical description, or a theorem statement, retrieve existing Mathlib declarations that can be reused directly in a formal proof. In the Archon framework for automated conjecture resolution, LeanSearch is the formal theorem search engine that powers proof synthesis in Lean 4. There it acts as the main retrieval layer between Rethlas, which operates in natural language, and Archon, which formalizes candidate arguments into Lean 4 projects with machine-checkable proofs. Its operational role is to surface relevant Mathlib infrastructure, avoid re-deriving existing results, and help decide whether a formal gap should be closed by library reuse or by new formalization (Ju et al., 4 Apr 2026).
The retrieval problem is nontrivial because Mathlib is large and heterogeneous. A proof obligation is represented by a Lean goal state, including the metavariable target and local context, and retrieval must respect the current environment, namespace resolution, binder structure, typeclasses, and the elaborator’s typing discipline. The same difficulty appears at a larger scale in what LeanSearch v2 later calls global premise retrieval: not merely finding one locally applicable declaration, but recovering the coherent set of library lemmas that jointly support the proof strategy of a single theorem (Gao et al., 13 May 2026).
This broader search problem is also reflected in user-centered systems. Lean Finder characterizes “LeanSearch” as the practical task of finding the right lemma, definition, or theorem in Lean/mathlib from an informal query, a partially remembered statement, or a live proof state, emphasizing the mismatch between real user queries and the neutral informalizations used by earlier search engines (Lu et al., 8 Oct 2025).
2. Core architecture and standard retrieval pipeline
In its Archon-era form, LeanSearch is integrated as a tool via CLI/MCP and indexes the Lean 4 Mathlib environment. The indexed corpus comprises over 267,000 theorems and 127,000 definitions contributed by more than 770 members. Items are stored with names and namespaces, full Lean types, including implicit parameters, universes, and binders, and the constants appearing in their statements. Query construction is driven by the current goal’s head symbol, key constants, local hypotheses, and syntactic structure, and retrieval relies on fuzzy matching with compatibility heuristics over type-level and symbol-level information (Ju et al., 4 Apr 2026).
LeanSearch v2 recasts this into a more explicit retrieval architecture. Its standard mode uses a hierarchy-informalized Mathlib corpus extracted by Jixia. The corpus is organized by a declaration-level DAG induced by inter-declaration references, topologically sorted, and informalized bottom-up with dependency summaries as context. Each stored passage can include declaration kind, fully qualified name, type signature, value for definitions, classes, and instances, source location, dependencies, and a generated natural-language description. Candidate generation uses Qwen3-Embedding-8B with cosine similarity over structured passages, and the top-50 nearest neighbors are reranked by Qwen3-Reranker-8B using a binary relevance token whose score is interpreted as (Gao et al., 13 May 2026).
The published scoring formulas in standard mode are straightforward. Embedding retrieval uses
and reranking uses
No task-specific fine-tuning is introduced for these models in LeanSearch v2; the system instead emphasizes corpus construction, kind-aware presentation, and reranking over an informalized, dependency-aware representation (Gao et al., 13 May 2026).
A central design feature across versions is environment awareness. Returned declarations must be valid in the current Lean environment, and downstream integration relies on the elaborator, simp sets, and typeclass resolution to turn retrieved declarations into compilable tactic scripts or proof terms. This makes LeanSearch closer to a retrieval component inside the proof assistant than to a standalone document search engine (Ju et al., 4 Apr 2026).
3. Global premise retrieval and reasoning mode
LeanSearch v2 distinguishes single-query declaration search from global premise retrieval. The latter is the task of recovering the set of supporting library lemmas required by the overall proof strategy of a single theorem. This differs from per-tactic premise selection, which advances one local proof state at a time, and from semantic search, which retrieves a single declaration matching a query. The motivating example in the paper is the irreducibility of in , whose proof route depends on several scattered Mathlib declarations, including geom_sum_mul, Polynomial.cyclotomic_prime, and Polynomial.cyclotomic.irreducible (Gao et al., 13 May 2026).
To address this, reasoning mode builds on standard mode through iterative sketch–retrieve–reflect cycles. A theorem’s formal statement and informal description are given to a sketch generator, which produces a step-by-step proof outline and a retrieval query for each step. Standard mode is then called on each sub-query; a document filter independently labels returned candidates as relevant or not; and a feasibility judge decides whether the sketch is supportable by the library. If not, a reviser produces a new sketch, with the loop capped at three revisions after the initial sketch. Final ranking is not based on raw cross-step scores, which are not calibrated, but on a rank-discount aggregation:
where is the $0$-indexed rank of declaration in filtered list (Gao et al., 13 May 2026).
This reasoning layer makes retrieval explicitly proof-plan conditioned. Instead of asking only which declaration matches the current text, it asks which declarations jointly instantiate a viable route through Mathlib’s existing interfaces. The reported control flow uses Claude Sonnet 4.5 for sketch generation and feasibility judging, Kimi K2 Instruct for filtering, and standard mode as a retrieval substrate. A plausible implication is that LeanSearch v2 treats retrieval not merely as ranking but as a structured search over possible proof decompositions (Gao et al., 13 May 2026).
4. Use in agentic proving and formalization systems
LeanSearch’s importance is most visible in agentic formalization pipelines. In Archon, the Lean Agent attempts proofs obligation by obligation; when blocked, it calls LeanSearch to find definitions and theorems whose type signatures, head symbols, and binder shapes align with the current goal. Returned candidates are then translated into tactic scripts such as rewrite, apply, exact, refine, simp, ring, aesop, and norm_num. The paper gives examples involving completion commuting with quotient and height-one prime contraction in a UFD, where LeanSearch supplies the Mathlib facts that turn an informal commutative algebra argument into a formally verified Lean development. At project scale, this retrieval loop supports a 19k-line Lean 4 formalization completed in approximately 80 hours runtime (Ju et al., 4 Apr 2026).
LEAP uses LeanSearch differently but within the same general paradigm. Its proof search loop maintains an AND-OR DAG of goals and verified decompositions, and the direct formalization path includes a LeanSearch retrieval hook for mathlib lemma and API discovery. The system emphasizes blueprint-driven decomposition, compiler-verified sketches, and reviewer-guided pruning rather than a published retrieval model, but LeanSearch remains part of the mechanism by which local proof obligations are connected to reusable library material (Kung et al., 2 Jun 2026).
A distinct but related development is LeanSearch-PS inside REAL-Prover. LeanSearch-PS is a retrieval and premise-selection component conditioned on the current Lean proof state, trained from (state, applied-theorem) pairs extracted from Mathlib proofs with Jixia. It uses dense embeddings trained from E5-mistral-7b-instruct with Tevatron and LoRA, retrieves top-0 premises per step, and injects them into the prompt for the tactic-generation model. This is not the same system as LeanSearch v2’s global premise retrieval, but it occupies the same design space of Lean-native retrieval under proof-state conditioning. In ablations, REAL-Prover’s retrieval improves ProofNet Test from 22.6% to 23.7% and FATE-M Test from 44.7% to 56.7% (Shen et al., 27 May 2025).
5. Evaluation, benchmarks, and comparative position
LeanSearch v2 reports both search-specific and downstream proving metrics. On MathlibQR, its standard mode achieves nDCG@10 of 0.62 versus 0.53 for the next-best system, LeanFinder; on the fair-subset main table the numbers are 0.623 versus 0.533, with an LLM-as-judge mean rank of 1.63. On MathlibMPR, reasoning mode reaches Recall(group)@10 of 46.1%, compared with 38.0% for DIVER full pipeline and 9.3% for the best premise-selection baseline, LeanStateSearch; Covered@10 is 30.4%, compared with 24.6% for DIVER and at most 2.9% for premise-selection baselines. In a fixed downstream prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success: 20% versus 16% for the next-best retriever and 4% without retrieval on FATE-H, and 14% versus 10% versus 4% on MathlibMPR-Prop (Gao et al., 13 May 2026).
The earlier Archon paper presents a different style of evidence. It does not report retrieval metrics such as Hit@k, MRR, or MAP for LeanSearch itself; instead, effectiveness is demonstrated qualitatively through end-to-end formalization. The central claim is that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, and that LeanSearch repeatedly allows Archon to determine whether required facts already exist in Mathlib rather than formalizing infrastructure from scratch (Ju et al., 4 Apr 2026).
Comparative work helps locate LeanSearch within the broader retrieval landscape. TheoremGraph reports that its name-and-signature representation with one-hop graph expansion reaches Recall@10 of 0.775 on MathlibQR fair-810, within 0.5 percentage points of LeanSearch v2’s reranked Recall@10 of 0.780, but without an LM reranker. The same study notes that LeanSearch v2’s full system still leads on fine-grained ranking, with nDCG@10 of 0.623, whereas the graph-based alternatives reach 0.548–0.558 without reranking (Kurgan et al., 24 Jun 2026). Lean Finder, which is explicitly trained on synthesized user-intent queries and preference data, reports Recall@1 of 64.2 on informalized statements versus 49.2 for Lean Search, 54.4 versus 47.1 on synthetic user queries, and a user-study Top-3 rate of 81.6% versus 56.9% for Lean Search (Lu et al., 8 Oct 2025).
These comparisons indicate that LeanSearch occupies two roles simultaneously: a competitive benchmarked retriever in its own right, and an infrastructural substrate on top of which other systems measure advances in user-intent modeling, graph-aware expansion, or agentic reasoning.
6. Limitations and directions of development
The original Archon paper leaves several methodological details unspecified. It does not publish training procedures, scoring functions, or ranking metrics for LeanSearch itself, and explicitly states that the paper does not report quantitative retrieval metrics. It also notes concrete failure modes: type mismatches between retrieved lemmas and local goals, overly generic lemmas that complicate unification, distribution shift across Mathlib versions, and coverage gaps when Mathlib lacks required infrastructure, as in parts of Krull domain theory (Ju et al., 4 Apr 2026).
LeanSearch v2 makes the retrieval pipeline much more explicit, but its limitations are also more sharply delimited. The evaluation is restricted to Lean 4/Mathlib; downstream experiments use a minimal reflection workflow proving one theorem at a time; and highly compositional goals still fail when sketches do not align with existing interfaces. Reasoning mode also increases inference-time cost because it adds sketch generation, filtering, and judging on top of dense retrieval. The paper identifies better sketch generation, corpus enrichment, joint training, multi-hop retrieval, and tighter integration with tactic synthesis as future directions (Gao et al., 13 May 2026).
Related systems make these limitations more visible by contrast. TheoremGraph shows that fixed embeddings plus name-and-signature representations and one-hop graph expansion can almost match LeanSearch v2’s Recall@10 without an LM reranker, while Lean Finder shows that explicit modeling of mathematicians’ intents yields large gains on real-user-style queries (Kurgan et al., 24 Jun 2026, Lu et al., 8 Oct 2025). This suggests that the current research frontier is no longer only semantic matching of isolated declarations. It increasingly involves representation choice, graph structure, user-intent alignment, and the coupling between retrieval, proof planning, and elaboration.
LeanSearch therefore names both a concrete family of retrieval systems and a continuing research program in formal mathematical search: retrieving the right declaration in the right environment, at the right granularity, and in a form that an automated or interactive prover can immediately exploit.