LLM-Based Theorem Provers

Updated 15 November 2025

LLM-Based theorem provers are hybrid systems that integrate large language models with symbolic proof techniques to automate formal reasoning.
They employ diverse strategies such as whole-proof synthesis, stepwise tactic generation, and modular lemma extraction to achieve state-of-the-art results on benchmarks like miniF2F and ProofNet.
Their neural-symbolic collaboration accelerates formal verification across platforms like Lean, Isabelle, and Coq while enabling robust error correction and adaptive search.

LLM-based theorem provers are neural or neuro-symbolic systems that employ LLMs to generate, verify, refine, and guide the construction of formal mathematical proofs. These systems have rapidly advanced due to innovations in integration with proof assistants, fine-grained proof analysis, large-scale synthetic data generation, RL training protocols, proof-state tree search, and neuro-symbolic collaboration. They now set new state-of-the-art (SOTA) results on canonical formal reasoning benchmarks such as miniF2F, ProofNet, and LeanWorkbook, and are deployed across Lean, Isabelle, Coq, and domain-specific formal verification pipelines.

1. System Architectures and Key Components

LLM-based theorem provers typically fall into the following architectural paradigms:

Whole-Proof Generation: The LLM emits an entire proof script, with subsequent validation by a proof assistant. This is used in methods such as MA-LoT and HybridProver (Wang et al., 5 Mar 2025, Hu et al., 21 May 2025).
Stepwise/Tactic Generation with Tree Search: The LLM proposes the next tactic for a given proof state; search procedures (BFS, beam search, MCTS) are employed to explore proof trees efficiently (Lai et al., 17 May 2025, Xin et al., 8 Sep 2025).
Modular/Block-based Generation: Proofs are constructed from modular blocks (lemmas or skill libraries), either via online lemma construction (LEGO-Prover (Wang et al., 2023)) or by recursive decomposition and recomposition (ProofAug (Liu et al., 30 Jan 2025)).
Neuro-Symbolic Hybridization: LLMs are combined with symbolic automation tools. Tactics and ATPs are interleaved, e.g., via maximal compatible semi-proofs in ProofAug or symbolic repair in PALM (Lu et al., 22 Sep 2024).
Meta-collaboration: Separate LLM modules for proof synthesis and proof correction interact, with error-correcting LLMs (as in MA-LoT (Wang et al., 5 Mar 2025)) leveraging verifier feedback iteratively.
Offline Lemma Extraction for Symbolic Provers: LLMs are used offline to distill reusable proof strategies/lemmas that augment symbolic automation libraries (Strat2Rocq (Fang et al., 11 Oct 2025)).
Dataset/Problem Generation Agents: Some systems enhance training or generalization by generating difficult new problems dynamically, using self-play (STP (Dong et al., 31 Jan 2025)) or large-scale synthetic exploration (Lai et al., 17 May 2025).

Many architectures incorporate plug-and-play modules, such as recursive LLM invocation for failing subgoals (ERP in ProofAug), or LLM-driven error analysis and re-planning (ProofCompass (Wischermann et al., 18 Jul 2025)).

2. Fine-grained Proof Analysis and Automation Integration

Recent advances fundamentally improve sample efficiency and proof search robustness by interleaving LLM outputs with formal proof-checker feedback at multiple granularities:

Proof Structure Analysis (ProofAug (Liu et al., 30 Jan 2025)):
- LLM proposals are parsed into linear sequences of proof steps $\mathbf{a} = (a_1, ..., a_n)$ .
- Each step $a_i$ is executed in the ITP; failing steps or blocks are replaced with "sorry" placeholders.
- The semi-proof is recursively coarsened, replacing minimal incompatible blocks (maximal compatible semi-proof, MCSP).
- Automation tools are called at each "sorry": built-in tactics (auto, simp, blast) and ATPs (Sledgehammer).
- An efficient recursive proving (ERP) module reinvokes the LLM on subgoals, possibly translating tactics into ATP queries before admitting block collapse.
Hybrid Synthesis and Refinement (HybridProver (Hu et al., 21 May 2025)):
- Whole-proofs are synthesized; sketches are extracted by replacing lower-level tactics with "sorry."
- Tactic-filling LLMs (augmented by tools like Sledgehammer) plug gaps in the sketch until all subgoals are closed.
Model-Collaborative Correction (MA-LoT (Wang et al., 5 Mar 2025)):
- A Prover LLM constructs the proof via chain-of-thought steps.
- A Corrector LLM repairs unsuccessful proofs using Lean4 error messages and re-verifies.

This multi-level granularity, together with tight integration with proof assistants, enables provers to efficiently leverage the strengths of both neural and symbolic proof search, yielding substantial gains in pass rates and query efficiency.

3. Synthetic Data Generation and Policy Learning

The current generation of LLM provers is critically dependent on large-scale, synthetically generated data covering difficult intermediate proof states and tactic combinations:

Proof-State Exploration (Lai et al., 17 May 2025):
- A breadth-first traversal, guided by constrained decoding over a curated tactic set ( $|\mathcal T| = 60$ ), generates $\sim$ 20M unique (state, tactic, premise, next state) tuples.
- Heuristic pruning preserves diversity; post-processing removes duplications and invalid steps.
- The resulting corpus is combined with human-authored data for one-shot fine-tuning: no multi-stage curriculum or expert iteration is used.
Self-Play and Conjecture Generation (STP (Dong et al., 31 Jan 2025)):
- Dual agents—conjecturer and prover—are trained in an iterative loop, with the conjecturer generating new, challenging but barely provable statements.
- Verification-admissible conjectures become new targets for training, and iterative sampling leverages the most informative proof efforts.
Role of Theorem Prover as a Judge (TP-as-a-Judge (Leang et al., 18 Feb 2025)):
- Iterative autoformalisation converts LLM-generated proof sketches into Lean code, refining via prover feedback.
- This enables rigorous quality filtering of synthetic data for SFT and RLHF-style training.

The primary result is that one-shot fine-tuning on high-quality synthetic traces ("policy learning") allows relatively compact LLMs ( $\sim$ 7B) to saturate Pass@1 rates on miniF2F and ProofNet, previously unattainable without hundreds of millions of human-annotated tactic traces.

4. Search, Tree-guided Reasoning, and Scalability

LLM-based provers exploit advanced search strategies and planning to address explosion in proof search space:

Adaptive Beam Search: Beam-size is decayed during tree exploration, balancing exploration with exploitation and maximizing pass rate per compute budget (Lai et al., 17 May 2025).
Best-First and Multi-Agent Search: BFS-Prover-V2 (Xin et al., 8 Sep 2025) introduces multi-turn off-policy RL for the policy/value model, adaptive tactic-level data filtering, and periodic retraining. At inference it employs a hierarchical Planner–Prover paradigm where a general LLM decomposes the main goal into subgoals, which are then solved in parallel by specialized prover agents with a shared cache.
Portfolio and Mixed Prompting: ProofAug (Liu et al., 30 Jan 2025) utilizes a portfolio of prompt templates (few-shot, zero-shot, legacy, sketches) for improved proof coverage and solution diversity.
Divide-and-Conquer Modularity: Modular problem decomposition (LEGO-Prover (Wang et al., 2023)) and explicit subgoal planning (often using a separate LLM) enhance both the depth and breadth of provable theorems by the system.

These architectural choices underlie new SOTA on task benchmarks, for instance, BFS-Prover-V2 achieving 95.08% on miniF2F and 41.4% on ProofNet (Xin et al., 8 Sep 2025).

5. Neural-Symbolic Collaboration and Error Correction

Incorporating symbolic reasoning engines and error-aware repair pipelines is central to accuracy and reliability:

Automated ATPs as Correctors: Frameworks like PALM (Lu et al., 22 Sep 2024) and ProofAug (Liu et al., 30 Jan 2025) call ATPs (Sledgehammer, CoqHammer) to discharge or patch failing subgoals, and iteratively backtrack or repair proof steps deterministically based on error types.
Activation Steering and Inference-Time Biasing: Activation steering (Kirtania et al., 21 Feb 2025) modifies hidden-state trajectories during tactic prediction to increase the likelihood of more structured or human-like proof search behaviors without any additional fine-tuning.
Lemma Extraction and Replay: Extracting LLM-discovered lemmas (Strat2Rocq (Fang et al., 11 Oct 2025)) and adding them to the symbolic prover’s library can yield substantial improvements (+13.41% CoqHammer theorem success).

Each approach leverages explainability (proof traces), improved error recovery, and resource-efficient inference, supporting deployment in secure or resource-constrained verification contexts.

6. Empirical Benchmarks and Quantitative Impact

LLM-based theorem provers now dominate established benchmarks for formal reasoning, including miniF2F (Lean/Isabelle), ProofNet, and LeanWorkbook. Selected results:

Method	miniF2F-test	Query Budget	ProofNet-test
ProofAug+ERP	56.1%	500	—
ProofAug (mixed, curated)	66.0%	2100	—
BFS-Prover-V2 (+Planner)	95.08%	up to 2048×2×600	41.4%
HybridProver	59.4%	128	—
STP (Self-Play, whole-proof)	65.0%	3200	23.9%
MA-LoT (Lean4)	61.07%	128	—
LEGO-Prover	50.0%	100	—

Key findings include:

Sample efficiency: ProofAug achieves +10pp higher pass rates with <1/8th the queries of Subgoal-XL on Isabelle (Liu et al., 30 Jan 2025).
Modular and portfolio prompting: Increases in coverage and pass rate by combining multiple strategies (Liu et al., 30 Jan 2025).
Offline lemma mining unlocks previously intractable proofs for symbolic engines (Strat2Rocq (Fang et al., 11 Oct 2025)).
In formal verification, LLM drafting of human-readable polynomial proofs yields near-100% automation for induction-based bounds with negligible computational overhead (Drechsler, 29 May 2025).
RL and self-play accelerate the rate of learning and diversity of both problem statements and solutions (Dong et al., 31 Jan 2025).

7. Limitations, Challenges, and Future Directions

Current systems confront several intrinsic challenges:

Distribution Shift and Generalization: Out-of-domain generalization remains a bottleneck; lemma-level rewards mainly benefit short proofs or moderately deep reasoning (see (Dong et al., 4 Nov 2024)).
Sparsity and Data Efficiency: While synthetic data can saturate stepwise tactic models, the generation of highly novel, deep conceptual lemmas or proofs is still limited by dataset diversity and model scale.
Scalability to New Formalism: Many frameworks are tightly coupled with particular proof assistant grammars or tactic sets; adaptation to new formalisms (e.g., higher-order logic, category theory) requires substantial engineering.
Explainability and Error Localization: Mapping formal proof-checker errors to human-understandable guidance is nontrivial; methods such as sub-proposition error feedback (DREAM (Cao et al., 20 Jun 2025)) are early steps.
Compute and Latency Trade-offs: Hierarchical, multi-agent systems (e.g., BFS-Prover-V2) and retrial loops (MA-LoT) trade pass rate for latency and computational overhead.

Future research points include domain-adaptive training, the synthesis of abstract proof strategies, dynamic axiom discovery, and human-in-the-loop or semi-supervised verification architectures.

LLM-based theorem provers now occupy a central role in the automation of formal reasoning, exemplifying the synergy between neural architectures, symbolic verification, modular library extension, and scalable data-driven approaches. Their rapid advancement is transforming both mathematical formalization and applied verification, while opening a range of open problems at the intersection of machine learning, logic, and formal mathematics.