Papers
Topics
Authors
Recent
Search
2000 character limit reached

MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving

Published 26 May 2026 in cs.LO and cs.CL | (2605.26959v1)

Abstract: MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.

Authors (3)

Summary

  • The paper demonstrates a multi-agent recursive harness for automating Lean 4 theorem proving by decomposing complex proofs into manageable nodes.
  • It isolates planning, code synthesis, and verification into distinct agents, ensuring axiom-freedom, signature fidelity, and effective error recovery.
  • Empirical results show competitive kernel-audited solves on FormalQualBench and PutnamBench, highlighting both efficiency and cost-effectiveness.

MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving

Introduction and Motivation

MerLean-Prover introduces a recursive agentic harness for automated theorem proving in Lean 4 that isolates proof planning, Lean code synthesis, and verification roles into separate agents. It targets the formalization and automated proof of advanced mathematical theorems, specifically focusing on the challenges at PhD-qualifying-exam level—the current cutting edge in formal theorem-proving by LLM agents. The core innovation is the decomposition of long-horizon proof construction into a minimal-objective-per-agent approach, addressing context and constraint degradation in LLMs during long iterative formalization. Rather than packing the full proof state and error history into prompts, MerLean-Prover externalizes the global proof plan and revises it recursively, enforcing faithfulness, correctness, and signature preservation at the harness level, not the model level.

System Architecture

MerLean-Prover operationalizes its approach with three agent types, each restricted to a well-defined objective:

  • Planning Agent: Responsible for constructing and revising a topologically ordered proof plan based on the Lean file input. It is invoked both for the initial planning phase and for dynamic replanning in response to failures detected downstream.
  • Lean Agent: Given a single proof statement from the plan, it synthesizes Lean code attempting to discharge that statement, potentially using intermediate placeholders (sorrys). No single Lean Agent invocation handles more than one proof node at a time.
  • Check Agent: Stateless, read-only evaluators, each invocation answering a single question: mathematical soundness, need for further decomposition, or faithfulness to the original theorem signature.

At all times, the only globally shared state is the mutable proof plan, which is refined in response to feedback from the Check Agent(s) and failures of the Lean Agent to close proof nodes. Harness-level constraints guarantee axiom-freedom, signature fidelity, and exclude ad-hoc theorem-specific interventions, fine-tuning, or reward shaping.

Proof search proceeds recursively: the outer loop iterates over open proof plan nodes in dependency order, applies the Lean Agent, then invokes one or more Check Agents to validate outcomes. Any failure (compile errors, incorrect mathematics, or deviation from the theorem signature) may trigger a plan revision, which invalidates downstream nodes as needed. Thus, the harness orchestrates an iterative proof cone decomposition, splitting complex proof steps until closure is achieved or an external budget is exhausted.

Empirical Evaluation

FormalQualBench Results

On the FormalQualBench benchmark—a suite of 23 PhD-qualifying-exam Lean theorems—MerLean-Prover achieves 10/23 kernel-audited solves, exceeding the best previously published open-source system (OpenGauss, 8/23). Notably, all proofs are axiom-free except for a small permitted set ({propext,Quot.sound,Classical.choice}\{propext,\,Quot.sound,\,Classical.choice\}), and nine out of ten solves occur within a four-hour wall-clock budget, with the remaining theorem closing within 4h40m. The per-solved-problem average stands at \$118.35 LLM API cost and 1h59m wall-clock, recognizing that harness-induced prompt/read/write cycles raise dollar costs relative to wall-time acceleration via caching.

PutnamBench Performance

On the full set of 12 Putnam 2025 problems, MerLean-Prover closes 12/12, with fastest solve times on 8/12 problems. Its aggregate wall-clock (789 minutes) is significantly lower than other leading systems (e.g., Numina-Lean-Agent: 3889 minutes, Axiom: 2577 minutes) that also reach 12/12 closure. This shows that the recursive harness offers both effectiveness and efficiency at the undergraduate olympiad/Putnam difficulty level.

Stability and Generality

Across multiple runs on selected problems, repeated success is observed: all runs for each of four challenging FormalQualBench problems produced successful, kernel-checked proofs. The number of plan nodes and per-node closure time showed moderate variance, providing evidence against lucky single trajectories.

Critically, the harness generalizes to smaller inference models: with Sonnet, all four selected problems are closed; Haiku closes two; in all cases, kernel-checked proofs are produced. However, as model size decreases, plan decomposition may become finer and total cost can increase due to extended proof plans, reflecting a shifted cost-performance balance.

Key Claims and Implications

MerLean-Prover advances two bold claims:

  • Harness design is a central factor: Control flow and global-state management external to LLM prompt engineering can match or surpass the contributions of raw model scale at current state-of-the-art in complex Lean 4 theorem proving.
  • Simplicity suffices with the right architecture: No fine-tuning, reward shaping, or theorem-specific engineering is necessary for competitive formal proof synthesis. Generalist LLMs, when integrated through well-designed agentic harnesses, are capable of handling advanced formalization tasks.

This architecture robustly addresses the context/framing and constraint-following limitations of LLMs identified in prior work (Xin et al., 2024, Xin et al., 5 Feb 2025, Jaroslawicz et al., 15 Jul 2025): by limiting each agent prompt to a single local goal and delegating responsibility for global logical/semantic consistency to the harness, it enables reliable progress even in the presence of LLM memory and reasoning decay over long contexts.

Practical and Theoretical Implications

From a systems perspective, the MerLean-Prover design offers practical benefits:

  • Modularity and extensibility: The same harness applies unchanged to new benchmarks, new models, and alternative input formats, so long as the bridge to Lean-with-sorry files is available.
  • Model-agnostic deployment: The approach can leverage future open-source, Lean-specific LLMs ([mistral2026leanstral]), or more efficient/robust inference engines, holding down costs and eliminating dependency on proprietary API services.
  • Faithfulness auditing: The recursive check-and-replan mechanism actively prevents drift from the original theorem signature, catching subtle unsoundness or signature-weakened “solutions” that might otherwise slip past automated approval.

Theoretically, this work suggests:

  • Proof search as recursive multi-agent control: Large mathematical proofs can be constructed by recursively delegating original goals to planning (for subgoal decomposition), then closing subgoals in an iterated bottom-up process, rather than monolithic forward/backward search.
  • Resource-model tradeoff: The cost to closure for smaller/weaker models grows not monotonically by inference speed, but by the increased size and granularity of the final proof plan. There is a non-trivial inflection point where plan length and token volume outweigh per-call savings.
  • Boundaries of library reliance: While the harness is model- and theorem-agnostic, it still depends on the richness and accessibility of Mathlib within Lean. When proof steps require library support that is absent or hard for the model to surface, progress stalls—a limitation endemic in current autoformalization approaches.

Future Directions

Several extensions are natural:

  1. Engineering optimizations: Reducing redundant compile attempts and plan diffing, smarter cache handling, and more sophisticated plan invalidation could yield significant cost reductions.
  2. Lean-specialized model integration: Replacing the general-purpose LLM with a base model trained on Lean mathematics should further decouple cost from model size at no expense to proof quality.
  3. Library-guided decomposition: Using failure points in decomposition as signals to direct Mathlib expansion/documentation may automate the identification of “blockers” for future formal mathematics.

Conclusion

MerLean-Prover demonstrates that high-performance, end-to-end automated formal theorem proving in Lean 4 can be achieved via recursive, agentic harness design with strict one-objective-per-invocation control. This paradigm not only outperforms prior open-source baselines on FormalQualBench and matches top PutnamBench performance, but does so without fine-tuning or hand-crafted prompts. The empirical evidence reinforces the centrality of system-level design in the next phase of AI-augmented mathematics, and points toward general-agent architectures as a main line of advance for both practical trustworthiness and theoretical scalability in automated proof.

Reference: (2605.26959)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 17 likes about this paper.