Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
24 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
35 tokens/sec
2000 character limit reached

Leanabell-Prover-V2: Automated Lean Theorem Proving

Updated 14 July 2025
  • Leanabell-Prover-V2 is a large language model for formal theorem proving in Lean 4, integrating real-time verifier feedback with multi-turn chain-of-thought generation.
  • The model employs reinforcement learning to iteratively refine its proofs based on direct feedback from the Lean 4 verifier.
  • Evaluations on benchmarks like MiniF2F-test and ProofNet demonstrate its improved reliability and efficiency in generating verifiable proofs.

Leanabell-Prover-V2 is a 7B-parameter LLM specialized for formal theorem proving in Lean 4, distinguished by its integration of real-time verifier feedback within a multi-turn chain-of-thought (CoT) generation framework. Building on the prior Leanabell-Prover-V1, the V2 release advances performance principally through reinforcement learning (RL) using feedback and scoring from the Lean 4 verifier. This design enables the model to reflexively correct its own reasoning via direct verifier interactions and structured feedback during both training and inference, leading to measurable improvements on formal theorem proving benchmarks (Ji et al., 11 Jul 2025).

1. Model Architecture and Design

Leanabell-Prover-V2 retains the overall scaling strategy of its predecessor while introducing several key innovations:

  • Long Chain-of-Thought Generation: The model is explicitly trained to output extended, step-by-step reasoning sequences (long CoTs). Within these, code is encapsulated using delimiters such as <code> ... </code> (for Lean 4 code) and <interpreter> ... </interpreter> (for Lean verifier output), structuring the dialog between model and verifier.
  • Cold-Start Data Synthesis: Training begins with curated data comprising incorrect and corrected proof pairs. Incorrect model generations are immediately checked using the Lean 4 verifier, after which structured instructions prompt a correction; the synthesized dataset includes scenarios where error feedback is concatenated to prompt reflective rewriting.
  • Optimization Objectives: The objective pre-verifier integration is

maxθEsS,piMθ[I(pi passes verification)]\max_\theta \, \mathbb{E}_{s \sim \mathcal{S},\, p_i \sim \mathcal{M}_\theta}\left[ \mathbb{I}(p_i \text{ passes verification}) \right]

Post-integration, it becomes

maxθEs,pi,oi[I(p^i using oi passes)]\max_\theta \, \mathbb{E}_{s,\, p_i,\, o_i} \left[ \mathbb{I}( \hat{p}_i \text{ using } o_i \text{ passes}) \right]

where oio_i is verifier feedback and p^i\hat{p}_i the proof after correction.

2. Verifier Integration

A central advance in Leanabell-Prover-V2 is the deep coupling with the Lean 4 verification environment:

  • Multi-Turn Feedback Loop: For every Lean 4 code snippet generated, the code is executed immediately in the Lean 4 environment. The outcome—success message or full error log—is dynamically returned to the model, segmented by delimiters for further processing.
  • Self-Correction Mechanism: Upon identification of an error (such as type mismatches or tactic failures), the error feedback is appended to the CoT and used by the model to revise its proof. This process can occur over multiple iterations, emulating a dialogue between the prover and external verifier.
  • Self-Awareness through Feedback: The system develops “self-awareness” regarding the correctness of its reasoning, learning to revise in response to explicit verifier feedback at any step—a capability that is essential for formal methods.

3. Reinforcement Learning Strategy

Leanabell-Prover-V2 employs a RL framework (using the DAPO algorithm) to optimize its output with respect to verification outcomes:

  • Token-Level Optimization: The RL loss is computed at the token level, with the reward

Ri={Rsuccess,if proof passes Rfailed,otherwise R_i = \begin{cases} R_\text{success}, & \text{if proof passes} \ R_\text{failed}, & \text{otherwise} \ \end{cases}

augmented by a format reward to encourage well-structured outputs.

  • Feedback Token Masking: During supervised fine-tuning and RL, all verifier feedback tokens are masked, preventing them from influencing model generation gradients. This localizes optimization strictly to model-generated content.
  • Multi-Turn RL Design: The model is trained over multiple CoT-turns, generating an initial proof and then successively refining it via rounds of verifier feedback, up to a set iteration limit. The DAPO objective is

JDAPO(θ)=Es,{pi}[1tokensi,tmin ⁣(ri,t(θ)A^i,t,clip(ri,t(θ),1εlow,1+εhigh)A^i,t)]J_{\text{DAPO}}(\theta) = \mathbb{E}_{s, \{p_i\}} \left[ \frac{1}{\text{tokens}} \sum_{i,t} \min\!\left( r_{i, t}(\theta) \hat{A}_{i, t}, \operatorname{clip}( r_{i, t}(\theta), 1-\varepsilon_\text{low}, 1+\varepsilon_\text{high}) \hat{A}_{i, t} \right) \right]

with ri,t(θ)r_{i,t}(\theta) the probability ratio and A^i,t\hat{A}_{i,t} the normalized advantage.

4. Performance and Evaluation

Evaluation on key benchmarks demonstrates Leanabell-Prover-V2’s improvements:

  • MiniF2F-test: On this competition-level benchmark, Leanabell-Prover-V2 built on Kimina-Prover-Preview-Distill-7B improved pass@32 by 5.3% (and achieved near-baseline pass@1024 at only pass@128, indicating increased inference efficiency). When built on DeepSeek-Prover-V2-7B, a consistent 1–2% gain at standard sampling rates was observed.
  • ProofNet and ProverBench: Further experiments on undergraduate- and diverse mathematics benchmarks confirm that RL with verifier feedback leads to higher solve rates and more stable, self-correcting model behavior. Training curves in the paper track response length, reward scores, “solve all” rates, and show monotonic improvement.

5. Practical Applications and Implications

Leanabell-Prover-V2’s architecture enables several practical advances:

  • Robust Proof Synthesis: By integrating real-time Lean 4 feedback into generation, the model avoids propagation of logical errors, producing reliably verifiable proofs in fewer attempts.
  • Interactive Theorem Proving: The multi-turn, tool-integrated paradigm opens new possibilities for AI-assisted or fully automated Lean proof development, potentially lowering entry barriers and expediting formal verification.
  • General AI Self-Improvement: The method illustrates a route for LLMs to become more “self-aware” through interaction with symbolic verifiers, a property of broad relevance for AI in other correctness-critical domains.

6. Source Code and Data Availability

All source code, datasets—including “incorrect-correct” proof pairs, formatted CoT examples—and pretrained models are available at https://github.com/Leanabell-LM/Leanabell-Prover-V2. Instructions for running experiments and reproducing results (Lean 4.9.0, model checkpoints, etc.) are provided, ensuring accessibility for both academic and applied research (Ji et al., 11 Jul 2025).

7. Context and Significance

Leanabell-Prover-V2 marks a convergence of neural whole-proof generation, real-time verification, and reinforcement learning with external tool feedback. Its approach improves sample efficiency, reliability, and solution rates relative to previous Lean-specialized LLMs. The explicit model–verifier interaction and reflexive correction constitute an advance towards “verifier-integrated” reasoning in LLMs, offering a robust template for future developments in automated formal mathematics and AI alignment (Ji et al., 11 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)