Papers
Topics
Authors
Recent
Search
2000 character limit reached

TheoremGraph: Bridging Formal and Informal Mathematics

Published 24 Jun 2026 in cs.IR, cs.AI, and math.HO | (2606.25363v1)

Abstract: Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly: informal papers cite mostly at the document level, while formal libraries record fine-grained dependencies over a much smaller body of mathematics. We introduce TheoremGraph, a unified statement-level dependency graph spanning both informal and formal mathematics. On the informal side, we parse 11.7M theorem-like environments from mathematics arXiv and recover 18.3M candidate directed dependencies, each labeled by the extractor that proposed it so downstream users can trade coverage for precision. On the formal side, we release LeanGraph, a Lean 4 elaborator-level extractor producing 388,105 declaration nodes and 11.3M typed edges across 25 Lean projects. We bridge the two graphs by embedding generated natural-language slogans into a shared semantic space, linking related statements across papers and across the informal/formal divide; an LLM judge affirms 47,952 such matches above a 0.8 cosine floor, with the judge-acceptance rate rising from 48% across the floor to 87% in the >=0.9 tier. On formal concept retrieval, our name-and-signature representation with graph expansion comes within 0.5pp of LeanSearch v2's reranked Recall@10 (0.775 vs. 0.780) without an LM reranker. We release the dataset, extractors, HTTP API, and MCP interface as infrastructure for mathematical search, attribution, and retrieval-augmented reasoning, available at theoremsearch.com and huggingface.co/datasets/uw-math-ai/theorem-matching.

Summary

  • The paper builds a fine-grained dependency graph linking 11.7M informal and 388K formal statements to bridge informal literature with precise formal proof systems.
  • It introduces LeanGraph, a kernel-level extractor for Lean 4 that achieves 98.8% precision on deterministic edges through multiple extraction strategies.
  • The paper employs LLM-based sloganization and vector embeddings to semantically align mathematical statements, enhancing autoformalization and retrieval tasks.

TheoremGraph: Bridging Formal and Informal Mathematics

Motivation and Contributions

TheoremGraph addresses the longstanding fragmentation between informal mathematical literature—vast, loosely structured, and reference-coarse—and the formalized mathematics ecosystem, which is precise and explicit but relatively narrow in coverage. The paper introduces a unified, statement-level dependency graph integrating 11.7M theorem-like statements from mathematics arXiv with 388K Lean 4 formal declarations, encompassing a total of 18.3M informal and 11.3M formal dependencies. This integration distinguishes itself by explicitly mapping statement-level dependencies, surpassing prior document-based or coarser-grained bibliometric analyses.

Key contributions include:

  • Construction of a massive, fine-grained statement-level dependency graph for both informal and formal mathematics, with each dependency labeled by its extraction method (deterministic, heuristic, notation-based), allowing downstream users to modulate the coverage/precision tradeoff.
  • Release of LeanGraph, a kernel-level dependency extractor for Lean 4, producing a typed, semantically annotated dependency graph across all major Lean repositories.
  • Bridging of informal and formal mathematics via shared, LLM-generated natural-language “slogans,” embedded into a common high-dimensional vector space.
  • Systematic evaluation demonstrating high-precision (98.8%) deterministic edges, and the successful matching of 47,952 (informal, formal) statement pairs as validated by a strict LLM judge, with 87% precision at the highest similarity band.
  • Open-sourcing of all infrastructure, dataset artifacts, and APIs for broad adoption by the mathematical and AI/ML communities.

Methods: Extraction and Graph Construction

Informal Dependency Graph

Informal theorem statements are parsed from mathematics-tagged arXiv papers using a robust LaTeX-based pipeline. The extractor utilizes three edge-generation strategies:

  • Deterministic extraction leverages parseable, explicit references via LaTeX \ref and \cite commands, resolving within-paper and cross-paper links down to the statement level wherever possible.
  • Heuristic extraction identifies proximity-based references and backward discourse signals (“Theorem 3.2 above”), constrained to preserve causal document order.
  • Notation-based extraction employs LLMs to track the introduction and usage of mathematical symbols, constructing edges from definitions to subsequent uses within the same document stream.

Each proposed edge is labeled by its source extractor. Evaluation against LLM-human judged ground truth on 500 papers confirms high-precision for deterministic edges (98.8%), with lower but complementary recall from heuristic and notation extractors.

Formal Dependency Graph

LeanGraph traverses the Lean 4 kernel environment, extracting post-elaboration semantic dependencies:

  • All user-facing declarations are nodes, classified via Lean’s documentation and type APIs to exclude compiler artifacts.
  • Six typed edge categories distinguish structure/class inheritance (extends), structure fields, signature-level constant usage, proof-term dependencies, value-level def bodies, and docstring cross-links.
  • Cross-library dependencies are included, yielding a dense, high-fidelity formal knowledge graph.

This fine distinction allows for nuanced downstream analysis and tool-assisted navigation.

Cross-Formality Linking via Semantics

Sloganization and Embedding

Every statement from both the informal and formal corpora is processed through LLM-based sloganization (Qwen3-235B), producing concise, one-line natural language summaries. These slogans are embedded (dimension 4096, Qwen3-Embedding-8B) into a shared vector space. Efficient ANN (HNSW) facilitates fast nearest-neighbor lookups, enabling semantic search and alignment across corpora without reliance on lexical overlap.

Matching and Verification

Cross-formality matches are identified by vector similarity; for each formal node, its top-scoring informal neighbor is proposed as a candidate match. Judging uses a strict, domain-calibrated LLM (GPT-5.4), which labels matches as exact (semantic equivalence), inexact (subsumption, generalization), or wrong (semantically distinct).

Of 100,799 candidates with cosine similarity above 0.8, 47,952 are affirmed matches (exact or inexact); match rate rises monotonically with similarity, achieving 87% in the ≥0.9 band. Control experiments confirm that true matches dominate top ranks, and that the embedding signal captures semantic, not merely lexical, agreement. Manual calibration confirms that the judge strikes a conservative balance between recall and precision.

Downstream Benchmarks and Retrieval Evaluation

The representation underlies enhanced mathematical search and autoformalization:

  • Autoformalization: In a realistic setting (new Mathlib v4.30 theorems, retrieval over v4.29), adding retrieved premises from TheoremGraph improves evaluated correct autoformalization from 5/24 to 8/24, and reduces computational cost relative to brute-force library search.
  • Retrieval Benchmarks: Against LeanSearch-v2, integration of name-and-signature information with graph-based query expansion raises Recall@10 to 0.775—within 0.5pp of the reranked LeanSearch-v2 baseline, using no reranking. This increment is primarily due to improved handling of structure/class definitions typically characterized by weak slogans alone.

Empirically, inclusion of explicit dependency context and programmatically accessible graph expansion enhances both concept retrieval and premise discovery tasks.

Implications and Future Directions

By providing a linking scaffold of judged formal-informal matches, TheoremGraph enables a new layer of retrieval-augmented reasoning. It supports both:

  • Attribution and mathematical discovery: Fine-grained dependency tracking clarifies provenance, mitigates duplication, and supports bibliometric analyses at the theorem level.
  • Autoformalization and agentic theorem proving: Cross-formality matches can seed automatic translation of informal mathematics into formal declarations, supporting both premise retrieval and proof synthesis.

Limitations still exist in recall (due to arXiv-scope and sloganization artifacts), in cross-formality equivalence (where semantic drift occurs at higher abstraction), and in the extension to premises selection (where single-step embedding is necessary but insufficient for logical dependency chains). The public graph and APIs, however, constitute robust infrastructure for the next generation of LLM-based mathematical reasoning, verification, and search.

Theoretical Impact

The approach situates semantic embeddings as a universal interface, turning the sparse and inconsistent notation of informal mathematics and the verbose, explicit dependencies of formal libraries into a navigable knowledge network. This opens possibilities for:

  • Cross-modal knowledge distillation for foundation model pre-training.
  • Unification of mathematical information retrieval, citation analysis, and formal proof automation methodologies.
  • Transfer learning and alignment evaluation across diverse mathematical subfields and writing traditions.

Conclusion

TheoremGraph operationalizes the long-theorized integration between informal mathematical writing and formal proof systems. By combining large-scale statement-level dependency extraction, LLM-based slogan embedding, and semantically-grounded cross-formality matching, it delivers high-precision infrastructural resources for mathematical search, attribution, and automated reasoning. The dataset, extractors, and APIs are positioned to catalyze future developments in the intersection of formal methods and machine reasoning, and to provide a robust backbone for both scholarly navigation and agentic mathematical discovery.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper builds a giant “map” of math ideas and how they depend on each other. It connects two worlds:

  • Informal math: the way people write math in papers and textbooks (using words, symbols, and explanations).
  • Formal math: the way computers check math, using precise code-like statements in systems like Lean.

The authors create TheoremGraph, a combined network that links matching ideas across these two worlds so people and AI can find the right results faster, give proper credit, and avoid re‑doing work that already exists.

What questions did the researchers ask?

They set out to answer a few simple questions:

  • Can we automatically find the building blocks of math (definitions, lemmas, theorems) in millions of research papers?
  • Can we figure out which statements rely on which others (their “dependencies”)?
  • Can we do the same for computer-checked math (Lean) and label the different kinds of dependencies clearly?
  • Can we match informal statements in papers to their formal versions in Lean, even if they’re written very differently?
  • Does this help search, learning, and AI tools that try to turn English math into precise, computer-checkable math?

How did they do it?

They followed several steps, using everyday ideas and simple checks along the way.

1) Collecting informal math (papers)

  • They scanned mathematics papers from arXiv and pulled out “theorem-like” parts (things called Theorem, Lemma, Definition, etc.). Think of this like highlighting every important statement in millions of pages.
  • They then tried to find links showing which statements depend on which others. They used:
    • Deterministic rules: clear signals like in-text references and citations.
    • Heuristics: clues from nearby mentions like “by Theorem 3.2” or “as above”.
    • Notation links: spotting when a symbol is defined in one place and used later.
  • Each link is tagged by how it was found so users can choose “very precise” or “broader coverage.”

2) Collecting formal math (Lean)

  • They built LeanGraph, a tool inside the Lean proof assistant, to record exact, fine‑grained links between formal objects (definitions, theorems, etc.).
  • They labeled edges by role, for example:
    • extends (one structure inherits from another),
    • field and sig (type-level info),
    • proof and def (used in proofs or in definitions),
    • docref (mentions in documentation).
  • This gives a clean, detailed map of how formal math pieces fit together.

3) Connecting the two worlds with “slogans” and “embeddings”

  • For every statement (informal or formal), they asked an AI to write a short, clear, one-line summary—a “slogan.”
    • Example: “Every finite subgroup of the multiplicative group of a field is cyclic.”
  • They turned each slogan into a “number fingerprint” (an embedding). You can imagine this as placing each statement as a point in a gigantic 4,096‑dimensional space where similar meanings land close together.
  • Then they looked for nearest neighbors: statements from papers and Lean that mean (nearly) the same thing.

4) Checking matches

  • An AI “referee” (a judging model) reviewed high‑similarity pairs and labeled them as exact match, close match, or wrong. This creates a trusted set of cross-links between informal and formal math.

5) Trying it out

  • They tested if this helps:
    • Finding the right formal statement (search).
    • Helping an AI write a correct Lean statement from an English description (autoformalization).
    • Competing with a strong existing Lean search system, LeanSearch v2, using the same embedding model but different text representations.

What did they find, and why does it matter?

Here are the highlights, explained plainly:

  • A huge informal map:
    • About 11.7 million theorem-like statements from math papers.
    • About 18.3 million dependency links between them.
    • Deterministic links were very accurate; heuristic and notation links gave more coverage.
  • A detailed formal map:
    • About 388,000 formal Lean declarations.
    • About 11.3 million typed edges with clear labels for “what depends on what and how.”
  • A bridge between the two:
    • By comparing slogan-embeddings, they found tens of thousands of strong matches between paper statements and Lean declarations.
    • An AI judge confirmed about 48,000 matches (with higher acceptance for the most similar pairs). This shows we can reliably connect human-written math to computer-checked math at scale.
  • Better search without heavy reranking:
    • Using smart “what we embed” representations (names + type signatures + slogans) and a tiny bit of graph expansion, their system reached almost the same Recall@10 as a top Lean system that also uses a separate reranking model (0.775 vs. 0.780) — but without needing that extra reranker.
    • Translation: they nearly matched a strong baseline using simpler, faster steps.
  • Helping autoformalization:
    • When an AI tried to write a Lean statement from plain English, giving it the right retrieved context improved the number of correct results from 5/24 to 8/24.
    • Translation: good retrieval makes the AI more accurate while using fewer tokens and tool calls.

Why it matters:

  • Navigating math: Researchers and students can find the exact result they need faster.
  • Credit and clarity: It’s easier to see which results build on which, helping proper attribution and avoiding accidental duplication.
  • AI for math: Stronger retrieval and clear links across formats give theorem-proving AIs the context they need to succeed more often.

What could this change in the future?

  • Easier discovery: Imagine searching across both human-written papers and computer-checked math like you search the web—this brings that closer.
  • Stronger math assistants: AI systems can suggest the right lemmas, definitions, or formal statements to help you complete a proof.
  • Better collaboration: Formal and informal communities can meet in the middle—informal ideas get matched to formal tools, and formal libraries benefit from the vast informal literature.
  • Less duplicated work: With clearer maps of who proved what and how, researchers are less likely to unknowingly re‑prove the same result.

In short, TheoremGraph turns scattered math knowledge into a connected, searchable web—across both regular papers and computer-checked proofs—so people and AI can build on it more effectively. The authors also release their data and tools, so others can use and improve this infrastructure.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper.

  • Informal-edge semantics are untyped beyond “deterministic/heuristic/notation” labels; there is no taxonomy for informal dependencies (e.g., definitional use, lemma invocation, assumption import), hindering downstream filtering and weighting.
  • The notation extractor has low precision (42.7% judge-verified); robust, supervised methods for definition/use linking (e.g., symbol parsing, operator-tree matching, fine-tuned LMs) and a gold-standard evaluation set are missing.
  • Heuristic within-paper edges suffer proximity confounds (23.4% FP); discourse-aware models (rhetorical structure, section headers, cue phrases) and ablations for reducing adjacency-driven false links are not explored.
  • Cross-paper reference resolution works only when the cited theorem is named; mapping generic citation contexts to specific target statements in cited papers remains largely unresolved.
  • The regex-based parser for theorem-like environments may miss macro-defined/custom environments and may mis-parse; there is no measured parsing recall/precision across arXiv styles or a macro-expansion preprocessing step.
  • LLM-judge-only edges (6,372 in the 500-paper sample) indicate recall gaps for extractors; no active-learning loop is proposed to mine these misses and retrain extractors to close recall deficits.
  • The formal graph’s edge correctness/completeness is not quantitatively validated against kernel truth or existing tools (e.g., Jixia, Lean Atlas); cross-tool comparisons and spot checks are absent.
  • Formal-edge importance is not modeled beyond categorical type (extends/field/sig/proof/def/docref); learned weights or priors over edge utility for search and reasoning are not investigated.
  • Inclusion of anonymous instances and tactic objects (tagged but kept) is not ablated; the impact of these on retrieval precision and graph noise is unknown.
  • Minimal dependency sets (proof-minimized premises) are not computed; a pipeline to derive minimal or near-minimal premise sets per theorem is left open.
  • Cross-formality matching is restricted to top-1 formal→informal candidates; many-to-many correspondences (one informal statement formalized by multiple lemmas, or vice versa) and clustering/deduplication strategies are not addressed.
  • Candidates below the cosine 0.8 floor are not explored despite potential true positives; calibrated thresholds, rerankers, and active-learning on lower-similarity bands are not evaluated.
  • Matching judgments rely on slogans with limited context; protocols for contextualized judging (full statements, assumptions, local proof context) and their effect on precision/recall are not tested.
  • Expert calibration is minimal (10 pairs); a larger, multi-expert, publicly released adjudicated set with inter-annotator agreement for exact/inexact/wrong is lacking.
  • The nature of “inexact” matches (generalization, special case, one-way implication) is not annotated; directionality and assumption alignment metadata are absent, limiting safe reuse.
  • A large fraction of Lean declarations have no informal neighbor at ann_k=50 (172,311); alternative embedder families, multi-query retrieval, and additional corpora beyond arXiv are not assessed.
  • Slogan and embedding choices are fixed (Qwen3 generators/embeddings) without ablation or domain tuning; comparisons with other encoders (e.g., text-embedding-3, bge-m3), multi-vector representations, or fine-tuning are missing.
  • Formula structure is not exploited in matching; integrating symbolic formula embeddings (e.g., operator/symbol-layout trees) with text embeddings remains an open avenue.
  • The autoformalization study is small (24 targets) and uses back-translated queries; a larger, diverse benchmark with human-authored informal statements, public splits, and standardized metrics is needed.
  • Only statement synthesis (:= sorry) is tested; effects on proof generation, premise selection, and end-to-end proving performance are not evaluated.
  • The QR-tuned setup degrades MathlibMPR premise retrieval (0.224→0.165 R@10); task-specific architectures or multi-objective training to reconcile concept vs chained-premise retrieval are not explored.
  • No cross-encoder or LM reranker is used; the quality/latency/cost trade-offs of reranking for top-k ordering are untested.
  • Query under-determination (10/24 unsolved across conditions) is observed but not addressed; uncertainty detection, query disambiguation, and interactive clarification strategies are not studied.
  • Informal dependency extraction is validated only against an LLM judge, not human ground truth; building and releasing a human-annotated dependency dataset across subfields is outstanding.
  • Error/bias analysis by subfield, age, venue, or writing style is absent; differential failure modes (e.g., analysis vs algebra) and tailored models remain unexplored.
  • End-to-end evaluations on duplicate detection and precise attribution (motivating use cases) are missing; benchmarks and metrics for these tasks are not provided.
  • Confidence calibration for edges and matches (precision–recall curves, well-calibrated probabilities) is not reported, complicating threshold selection for applications.
  • The informal corpus is arXiv-LaTeX only; extending to journals (PDF-first), books, and other repositories (with tools like GROBID) is unaddressed.
  • Licensing limits public release to 23,399 judged candidates; mechanisms for on-site querying of restricted items or federated evaluation protocols are not proposed.
  • Incremental updates (versioning, change tracking, deprecation) are not described for keeping up with arXiv/Mathlib growth; reproducible snapshots and delta pipelines are open work.
  • Multilingual handling is not discussed; assessing/handling non-English math and cross-lingual sloganization is an open question.
  • Promised “systematic analyses” of graph structure are not detailed; deeper statement-level network studies (centrality, communities, influence) and comparisons to bibliometrics remain to be done.
  • Graph-based propagation of matches (aligning neighborhoods via label propagation from matched nodes) is not attempted; potential to expand cross-formality coverage via neighborhood consistency is unexplored.
  • Slogan faithfulness is not intrinsically evaluated; a human-rated slogan quality set and analysis of downstream error propagation from slogan inaccuracies are needed.
  • Subfield coverage bias (formal coverage strongest in foundations/UG/grad topics) is acknowledged but unquantified; measuring performance/coverage by area and mitigating gaps is future work.
  • Notation-edge false positives may contaminate graph-based retrieval; per-use-case filters or confidence thresholds tailored to downstream tasks are not provided.
  • Extension to other proof assistants (Coq, Isabelle/HOL, Agda) is not implemented; generalizing LeanGraph and cross-formality matching across PAs is an open technical and alignment challenge.
  • API-level support for higher-level agentic primitives (“find minimal premises,” “disambiguate citation,” “trace attribution”) and their evaluation with agents are not described.
  • Compute/cost/energy budgets for sloganization and embedding at 11.7M scale are not reported; reproducibility and sustainable re-indexing strategies (distillation, caching) are open engineering questions.
  • Index scalability/latency are not benchmarked (QPS, memory, recall–latency trade-offs for pgvector/HNSW and binary projections); operational SLAs for real-time use are unspecified.

Practical Applications

Below is an overview of practical, real-world applications enabled by TheoremGraph’s findings, methods, and released infrastructure. Each item names target sectors, gives concrete tools/workflows that could emerge, and notes assumptions or dependencies that affect feasibility.

Immediate Applications

  • Statement-level mathematical search and discovery
    • Sectors: Academia, Software/Developer Tools
    • Tools/workflows: TheoremSearch website/API/MCP interface for natural-language or LaTeX queries that return matched statements plus fine-grained dependencies; integration into command-line tools and notebooks for quick premise lookups.
    • Assumptions/dependencies: Coverage limited to mathematics arXiv and Lean projects; retrieval quality hinges on sloganization and Qwen3-Embedding-8B; results vary by extractor type (deterministic vs heuristic vs notation).
  • Precise citation recommendations for authors and editors
    • Sectors: Academic Publishing, Academia
    • Tools/workflows: Overleaf/TeX plugin or editor-side assistant that suggests specific theorem/lemma references (within- and cross-paper) instead of coarse paper-level citations; Zotero/Mendeley add-ons that fetch statement-level citations.
    • Assumptions/dependencies: Deterministic links show high precision (≈98.8% judged); heuristic and notation links require human-in-the-loop acceptance; arXiv TeX availability and correct parsing.
  • Prior-art and novelty scanning at submission time
    • Sectors: Academic Publishing, Research Policy
    • Tools/workflows: Editorial dashboards that flag high-similarity statement overlaps (e.g., cosine ≥0.9) to reduce inadvertent duplication; author-facing “preflight” checks before submission.
    • Assumptions/dependencies: arXiv-centric coverage; LLM-judged matches have variable confidence across similarity tiers (e.g., 87% acceptance for ≥0.9); requires human adjudication to avoid false positives.
  • Lean developer productivity and search inside IDEs
    • Sectors: Software/Developer Tools
    • Tools/workflows: VS Code extension powered by the name-and-signature representation with graph expansion to surface relevant declarations quickly; jump-to-dependency and typed-edge filters (sig/extends/field/def/proof/docref); Lean docs integration.
    • Assumptions/dependencies: LeanGraph availability for the target Lean version; continuous index updates; user filtering of instances/tactics if needed.
  • Retrieval-augmented autoformalization assistance
    • Sectors: Academia (formal methods), Software
    • Tools/workflows: Lean-side assistants that fetch relevant premises and signatures to help LLMs produce better “:= sorry” declarations; workflows where agentic provers query TheoremGraph during synthesis.
    • Assumptions/dependencies: Gains demonstrated on a small evaluation (8/24 vs 5/24 evaluated-correct) and depend on LLM quality, prompt design, and recall; formal coverage gaps limit applicability.
  • Course support and learning analytics via dependency maps
    • Sectors: Education
    • Tools/workflows: Interactive concept maps of prerequisite results for syllabi, lecture planning, and study paths; statement-centered handouts linking informal statements to Lean formalizations when available.
    • Assumptions/dependencies: Informal dependencies include heuristic edges with lower precision; instructor review recommended for teaching materials.
  • Notation consistency and hygiene checks for LaTeX manuscripts
    • Sectors: Academic Publishing, Academia
    • Tools/workflows: Linting tool that warns about undefined or conflicting notation and suggests prior definitions; copy-editing assistants that standardize symbols across sections.
    • Assumptions/dependencies: Notation extractor is lower precision (≈42.7% judged); treat outputs as warnings not hard errors; highly dependent on parser robustness.
  • Statement-level bibliometrics and impact dashboards
    • Sectors: Research Policy, Academia
    • Tools/workflows: Analytics portals showing the reuse and dependency centrality of specific statements (beyond paper-level citations), field maps, and influence trajectories.
    • Assumptions/dependencies: Bias toward areas well-represented in arXiv/Lean; need normalization across subfields; per-extractor filtering to control noise.
  • AI agent back-ends for math reasoning and document assistants
    • Sectors: Software/AI, Education
    • Tools/workflows: MCP-enabled math assistants that call TheoremGraph to fetch definitions, related lemmas, and cross-formality matches during reasoning or document drafting.
    • Assumptions/dependencies: API availability/quotas; agent tool-use policies; reliance on LLMs’ instruction-following.
  • Formalization roadmapping and progress tracking
    • Sectors: Academia (formal methods)
    • Tools/workflows: Project managers identify high-value informal statements that already have close Lean matches (or obvious gaps) to prioritize formalization efforts.
    • Assumptions/dependencies: Judged match set is restricted by licensing for redistribution; incomplete formal coverage can hide valuable targets.
  • Integration with scholarly discovery platforms
    • Sectors: Academic Publishing, Libraries/Knowledge Services
    • Tools/workflows: Plug-ins for arXivLabs, zbMATH, Semantic Scholar to surface “Related statements” and “Used by” edges at statement granularity.
    • Assumptions/dependencies: Partnerships and API integration; license compliance for content display.

Long-Term Applications

  • Automated peer-review support at statement level
    • Sectors: Academic Publishing, Research Policy
    • Tools/workflows: Reviewer dashboards that (1) check novelty against high-similarity statements, (2) verify that key dependencies are cited, and (3) propose missing attributions.
    • Assumptions/dependencies: Higher-precision matching and broader corpora beyond arXiv; stronger, multi-judge consensus or human oversight; publisher adoption.
  • Large-scale autoformalization from arXiv to proof assistants
    • Sectors: Academia (formal methods), Software
    • Tools/workflows: End-to-end pipelines where retrieved context, cross-formality matches, and typed edges guide models to produce formal declarations and eventually proofs; continuous integration that turns new arXiv content into formalization candidates.
    • Assumptions/dependencies: Advances in LLM-based formalization and proving; expansion to more areas of math; improved premise selection for chained-premise tasks.
  • Universal multi-assistant formal library integration
    • Sectors: Software, Academia
    • Tools/workflows: A cross-system graph linking Lean, Coq, Isabelle, and HOL statements to shared informal counterparts; cross-translation and reuse of results across assistants.
    • Assumptions/dependencies: Building Coq/Isabelle extractors at LeanGraph’s granularity; reconciliation of foundational differences and naming schemes.
  • Micro-citation and statement-level attribution systems
    • Sectors: Research Policy, Academic Publishing
    • Tools/workflows: DOI-like identifiers for theorems/lemmas; credit assignment and metrics for specific results; fine-grained altmetrics and tenure dossiers.
    • Assumptions/dependencies: Community governance and standards; integration with ORCID/Crossref; universal identifiers and deduplication strategies.
  • Adaptive curricula and intelligent tutoring
    • Sectors: Education, EdTech
    • Tools/workflows: Personalized learning paths built from dependency graphs; automated prerequisite checks; formative feedback with links to formalized versions of learned concepts.
    • Assumptions/dependencies: Pedagogically grounded dependency labeling; vetted content quality; alignment with course objectives and assessment standards.
  • Research portfolio analysis and funding intelligence
    • Sectors: Research Policy, Funding Agencies
    • Tools/workflows: Statement-level maps that reveal foundational gaps, emerging areas, and overlapping efforts; programmatic guidance for targeted calls.
    • Assumptions/dependencies: Coverage bias toward open-access math; careful normalization across subfields and time; continuous data refresh.
  • Interdisciplinary translation and application discovery
    • Sectors: Cross-domain Science & Engineering, Computer Science
    • Tools/workflows: Bridging math statements to downstream CS/physics/engineering concepts by extending the slogan/embedding approach to other corpora; discovery of applicable lemmas for domain problems.
    • Assumptions/dependencies: New extractors for non-math domains; robust cross-domain embeddings; domain expert validation.
  • Enterprise knowledge graphs for math-heavy R&D
    • Sectors: Finance (quant), Cryptography, AI Safety/Verification
    • Tools/workflows: Internal deployments that parse proprietary LaTeX/notes into statement-level graphs; private search and compliance-grade traceability for algorithms/proofs.
    • Assumptions/dependencies: Data privacy and IP constraints; on-prem infrastructure and secure model hosting; custom tuning for domain-specific notation.
  • Compliance and assurance for safety-critical formal verification
    • Sectors: Software/Robotics, Energy, Aerospace
    • Tools/workflows: Linking requirements to formal proofs through typed edges and statement matches; audit trails from high-level properties to verified lemmas.
    • Assumptions/dependencies: Extension to formal methods libraries used in industry; tooling around certification processes; domain-specific ontologies.
  • Library-integrated, real-time authoring assistance
    • Sectors: Academic Publishing, Software/Developer Tools
    • Tools/workflows: Overleaf/IDE copilots that suggest exact theorem references as authors write, auto-generate “Related work (statement-level),” and highlight missing dependencies.
    • Assumptions/dependencies: High-precision, low-latency retrieval; publisher/library partnerships; user acceptance and UI/UX maturity.
  • Generalization to other scholarly domains (beyond mathematics)
    • Sectors: Legal, Life Sciences, Theoretical CS
    • Tools/workflows: Adapt the statement-extraction + slogan + embedding + dependency framework to claims, definitions, and protocol steps in other disciplines.
    • Assumptions/dependencies: Domain-specific pattern mining and evaluation; different notions of “statement” and “dependency”; availability of structured source (e.g., LaTeX or XML).

Glossary

  • Autoformalization: Translating informal mathematical statements into formal declarations that a proof assistant can check. Example: "Neural theorem proving and autoformalization address complementary stages in converting informal mathematics into machine-checked proofs: autoformalization translates informal statements into formal declarations, while theorem proving generates proofs for those declarations."
  • Bibliometrics: The quantitative study of scholarly publications and citations to assess influence and structure. Example: "Bibliometric work uses scholarly signals to measure influence, relatedness, and field structure."
  • Binary-quantized projection: A technique that projects real-valued embeddings to binary vectors to speed up approximate nearest-neighbor search. Example: "a binary-quantized projection is used for fast candidate generation."
  • Calculus of constructions: A higher-order typed lambda calculus foundation used by some proof assistants. Example: "Lean is one of the most prominent modern proof assistants, based on the calculus of constructions and supported by an active open-source ecosystem."
  • CiteRank: A bibliometric ranking algorithm that weights citations by factors like recency and network structure. Example: "while recursive prestige models such as Pinski--Narin influence, Eigenfactor, CiteRank, and .{Z}yczkowski's weighted impact factors weight citations by graph structure, recency, or linking behavior"
  • Cohen's kappa: A statistic that measures inter-rater agreement adjusted for chance. Example: "Cohen's κ\kappa corrects agreement for chance: κ=(pope)/(1pe)\kappa = (p_o - p_e)/(1 - p_e), where pop_o is the observed agreement and pep_e the agreement expected if the two raters labeled independently at their observed match rates."
  • Cosine similarity: A vector similarity measure based on the angle between embeddings, commonly used in retrieval. Example: "We judge every candidate at cosine similarity $0.8$ and above."
  • docref: A dependency edge type capturing documentation cross-references from code comments (docstrings). Example: "docref captures valid backtick references in docstrings."
  • Eigenfactor: A measure of journal influence that weights citations by the prestige of the citing sources. Example: "while recursive prestige models such as Pinski--Narin influence, Eigenfactor, CiteRank, and .{Z}yczkowski's weighted impact factors weight citations by graph structure, recency, or linking behavior"
  • Elaborator-level extractor: A tool that operates after type elaboration to recover precise, type-checked dependencies. Example: "LeanGraph is a Lean 4 elaborator-level extractor that builds typed declaration-dependency graphs from compiled Lean projects."
  • Environment API: The Lean kernel interface exposing the global collection of declarations for inspection. Example: "LeanGraph runs inside Lean 4 over the kernel Environment API, extracting dependencies from elaborated, type-checked declarations rather than source text."
  • Graph expansion: A retrieval technique that augments results by following edges to related nodes in a dependency graph. Example: "Combining the slogan with the name-and-signature representation (Configuration~D) and then adding graph expansion (Configuration~E) trades ranking for recall: Recall@10 climbs from 0.733 (F) to 0.767 (D) to 0.775 (E), while nDCG@10 slips from 0.558 (F) to 0.550 (D) to 0.548 (E)."
  • h-index: A citation-based metric where a scholar has index h if h papers have at least h citations each. Example: "Citation-count measures such as the h-index and journal impact factor summarize influence directly, while recursive prestige models such as Pinski--Narin influence, Eigenfactor, CiteRank, and .{Z}yczkowski's weighted impact factors weight citations by graph structure, recency, or linking behavior"
  • Hit@1: A retrieval metric indicating whether the correct item is ranked first. Example: "The partner ranks first 43.5\% of the time and within the top ten 69.9\% (Hit@1, Hit@10)."
  • HNSW index: Hierarchical Navigable Small World graph structure for fast approximate nearest-neighbor search. Example: "Vectors are stored in pgvector with an HNSW index"
  • HyDE: A retrieval technique (Hypothetical Document Embeddings) that rewrites queries into pseudo-documents to improve matching. Example: "This fuses the two rankings (HyDE), and phrases the search closer to our slogans."
  • Inductives (Lean): Definitions of inductive types in Lean, which generate constructors and recursion principles. Example: "Structures, classes, definitions, and inductives all benefit, whereas theorem and instance, which typically already have descriptive slogans, decline slightly."
  • Lean blueprint: A LaTeX document that mirrors a formal development and links each informal statement to its formal Lean declaration. Example: "A Lean blueprint~\citep{massot2020leanblueprint} is a \LaTeX{} document that writes out a development's definitions, statements, and proofs in human-readable form, and links each statement to the Lean declaration that formalizes it through author-written \textbackslash lean{} annotations."
  • Mathlib: The core community-driven mathematical library for Lean. Example: "The Mathlib v4.27--v4.29 graph contains 351K declaration nodes and 9.3M within-library edges;"
  • Matcher declarations: Compiler-generated pattern matching artifacts in Lean. Example: "This removes kernel-generated artifacts such as auxiliary recursors, noConfusion lemmas, matcher declarations, projections, and parent accessors, while retaining user-facing declarations."
  • nDCG@10: Normalized Discounted Cumulative Gain at rank 10; a graded relevance metric emphasizing top-ranked items. Example: "nDCG@10 is a rank-weighted score over the top 10 that rewards ranking the target declaration higher, normalized so a perfect ranking scores 1."
  • Nearest-neighbor lookup: Retrieving items most similar to a query vector in an embedding space. Example: "we map every statement into a shared semantic space using slogans, embeddings, and nearest-neighbor lookup."
  • noConfusion lemma: A Lean-generated lemma capturing injectivity/disjointness properties of constructors for inductive types. Example: "This removes kernel-generated artifacts such as auxiliary recursors, noConfusion lemmas, matcher declarations, projections, and parent accessors, while retaining user-facing declarations."
  • Operator Trees: Tree representations emphasizing operator structure in mathematical formulas for retrieval. Example: "Symbol Layout Trees and Operator Trees with formula-specific indexing methods"
  • pgvector: A PostgreSQL extension for storing and querying dense vector embeddings. Example: "Vectors are stored in pgvector with an HNSW index"
  • Pinski–Narin influence: A bibliometric prestige model that weights citations by the influence of the citing source. Example: "while recursive prestige models such as Pinski--Narin influence, Eigenfactor, CiteRank, and .{Z}yczkowski's weighted impact factors weight citations by graph structure, recency, or linking behavior"
  • Post-elaboration constant-reference graph: A dependency graph of constants constructed after the compiler’s elaboration phase. Example: "Jixia, the extractor behind LeanSearch~v2 \cite{jixia}, emits a raw post-elaboration constant-reference graph;"
  • Prop (Lean): The universe of propositions in Lean’s type theory used for logical statements. Example: "def captures constants used in non-Prop definition bodies;"
  • Recall@10: The fraction of queries whose correct target appears within the top 10 retrieved results. Example: "Recall@10 is the fraction of queries whose target declaration appears anywhere in the top 10 retrieved results, ranked by similarity."
  • Recursor: An automatically generated eliminator used to define functions by recursion on inductive types. Example: "This removes kernel-generated artifacts such as auxiliary recursors, noConfusion lemmas, matcher declarations, projections, and parent accessors, while retaining user-facing declarations."
  • Retrieval-augmented reasoning: An approach where retrieved external knowledge is incorporated to improve model reasoning or generation. Example: "as infrastructure for mathematical search, attribution, and retrieval-augmented reasoning"
  • Reranker: A secondary model that reorders retrieved candidates to improve ranking quality. Example: "without an LM reranker."
  • Slogan: A concise natural-language summary of a formal or informal statement used for embedding and retrieval. Example: "for each statement, we generate a slogan: a concise, standalone natural-language summary"
  • Symbol Layout Trees: Structural representations that capture the spatial arrangement of symbols in mathematical expressions. Example: "Symbol Layout Trees and Operator Trees with formula-specific indexing methods"
  • Tactic objects: Lean entities representing proof tactics, often treated specially in extraction or filtering. Example: "Anonymous instances and tactic objects are kept but tagged with metadata, allowing downstream consumers to filter them if needed."
  • Trusted kernel: The small, critical core of a proof assistant that verifies proofs for correctness. Example: "These systems allow mathematicians to express definitions, theorems, and proofs in a formal language whose correctness can be verified by a trusted kernel."
  • .{Z}yczkowski's weighted impact factors: A bibliometric family of weighted citation metrics proposed by Życzkowski. Example: "while recursive prestige models such as Pinski--Narin influence, Eigenfactor, CiteRank, and .{Z}yczkowski's weighted impact factors weight citations by graph structure, recency, or linking behavior"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 180 likes about this paper.