Seed-Prover 1.5: Automated Lean Theorem Prover

Updated 22 December 2025

The paper presents Seed-Prover 1.5, which integrates agentic reinforcement learning and hierarchical test-time scaling to solve complex formal proofs efficiently.
It employs a multi-stage pipeline combining a natural-language prover, a sketch model, and an agentic Lean prover to incrementally build and verify proofs.
Experimental benchmarks show superior performance on Putnam and IMO contests, reducing both proof time and computational costs compared to earlier models.

Seed-Prover 1.5 is a formal theorem-proving model developed for undergraduate- and graduate-level mathematics, grounded in large-scale agentic reinforcement learning (RL) and efficient test-time scaling (TTS) routines. Its architecture, training regime, and performance benchmarks represent a significant advance in the automation of formal reasoning in Lean, particularly at the level of challenging contests such as Putnam and the International Mathematical Olympiad (IMO) (Chen et al., 19 Dec 2025).

1. Formal Theorem Proving in Lean and Problem Setting

Seed-Prover 1.5 operates within Lean 4, an interactive theorem prover with a dependent-type kernel and a comprehensive mathematical library (Mathlib). The formal environment supports tactic-driven proof development, with each action validated by the Lean compiler. This strict verification ensures outputs are fully correct, sidestepping the hallucination problems observed in purely natural-language LLMs. The proving state is comprised of all declared definitions and proven lemmas as context, a current proof goal, and a set of permissible actions: tactic executions, lemma admissions, or invocations of external tools such as Mathlib search or Python evaluation.

Scaling LLMs to undergraduate and above-level formal problems is impeded by the need for hierarchical decomposition, library navigation, and long-horizon planning, compounded by the high "interaction tax" associated with dense Lean interactions and compilation failures. Prior approaches using natural-language LLMs achieve near-perfect Putnam performance, but attempts at formal, stepwise Lean proofs have typically entailed prohibitive computational costs (e.g., 500 TPU-days per proof) or high rates of rejected outputs (Chen et al., 19 Dec 2025).

2. Model Architecture and Agentic RL Paradigm

Seed-Prover 1.5’s architecture is structured as a pipeline:

Natural-Language Prover (Doubao-Seed 1.6): Produces a detailed, informal proof drafted in natural language.
Sketch Model: Transforms the informal proof into a lemma-driven Lean sketch, where challenging sub-lemmas are stubbed as by sorry.
Agentic Lean Prover: An encoder–decoder LLM, equipped with tool-calling special tokens, incrementally calls verification tools (Lean verification, Mathlib search, Python execution) while simultaneously generating Lean code and natural language explanations.

Internally, the Agentic Lean Prover is an autoregressive Transformer parameterizing the policy $\pi_\theta$ , which, at each step, processes the proof state $s_t$ (goal, context, tool responses) and produces token sequences (including tool calls or lemma proofs). The action space encompasses Mathlib queries, Lean verifications, Python executions, and textual reasoning.

Reinforcement learning is driven by binary outcome-based reward $R(\tau)$ : $+1$ if the target theorem is verified by Lean, $-1$ otherwise. The RL objective,

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$

is optimized by a PPO-style (proximal policy optimization) surrogate loss

$\mathcal{L}_{\mathrm{PPO}}(\theta) = -\frac{1}{\sum_i |\tau_i|}\sum_{i,t} \min\Bigl(r_{i,t}(\theta)\,\widehat A_{i,t},\mathrm{clip}(r_{i,t}(\theta),1-\varepsilon,1+\varepsilon)\,\widehat A_{i,t}\Bigr)$

where $r_{i,t}(\theta)=\pi_{\theta}(a_t|s_t)/\pi_{\theta_{\mathrm{old}}}(a_t|s_t)$ and $\widehat A_{i,t}$ is an advantage estimator.

Key training hyperparameters include batch size $G \approx 32$ , clip range $\epsilon \approx 0.1$ , learning rate $3 \times 10^{-6}$ , and $\sim 1,200$ RL steps. The training dataset comprises approximately 100,000 formal statements, filtered so that the supervised fine-tuned (SFT) model can solve samples at most three times under "light inference." Rollouts reach up to $64,000$ tokens and allow up to $28$ tool calls per trajectory (Chen et al., 19 Dec 2025).

3. Test-Time Scaling (TTS) Workflow

The TTS workflow employs a hierarchical multi-agent pipeline:

Step 1: The Natural-Language Prover generates an informal, lemma-based proof for an input formal Lean statement.
Step 2: The Sketch Model converts this into a Lean sketch, admitting $N$ sub-lemmas as by sorry.
Step 3: The Agentic Lean Prover attempts each sub-lemma under a Pass@3×3 budget.
Step 4a: If a lemma is proven, its proof is cached.
Step 4b: If disproven or timed out, the Sketch Model refines the NL proof and generates alternative sub-lemmas.

This loop continues until all leaf lemmas are proven or a maximum search depth is reached. Each agentic prover call is executed in parallel, with an overall compute budget proportional to $w^d$ (width and depth), and binary success/failure signals used to manage search tree expansions and pruning.

Pseudocode:

function TTS_Prove(stmt, depth=0):
    if depth > MAX_DEPTH: return Fail
    nl_proof ← NL_Prover(stmt)
    sketch ← Sketch_Model(stmt, nl_proof)
    for lemma in sketch.lemmas:
        result ← Agentic_Prover(lemma)
        if result = Fail:
            return TTS_Prove(stmt, depth+1)  # refine sketch
    return Success

4. Experimental Benchmarks and Ablation Studies

Seed-Prover 1.5 sets new state-of-the-art results on several mathematical formal proving benchmarks:

Approach	Budget	PutnamBench	Fate-H	Fate-X
AlphaProof	500 TPU-days/problem	56.1%	—	—
Hilbert	avg Pass@1 840	70.0%	—	—
Seed-Prover 1.0 (medium)	18 H20-days/problem	50.4%	35%	9%
Seed-Prover 1.5 (full TTS)	10 H20-days/problem	87.9%	80%	33%

On Putnam 2025, Seed-Prover 1.5 solved 11 of 12 problems within 9 hours, using ≤ 40 H20-days per problem. On IMO 2025, it proved 5 of 6 problems (e.g., P₁ in 16.5 hours, P₃ in 5 hours, P₅ in 1 hour) (Chen et al., 19 Dec 2025, Chen et al., 31 Jul 2025).

Ablation studies indicate that agentic RL confers a $\sim$ +10 pp increase in held-out Putnam-200 accuracy compared to SFT, while simultaneously reducing tool calls and proof context lengths. Most problems (80%) are solved within the first 5 hours of TTS; the remainder constitute a long computational tail up to 53 hours.

5. Comparative Developments and Version-Specific Enhancements

Seed-Prover 1.5 is a refinement upon the previous Seed-Prover v1.0 and related agentic LLM-based provers (Chen et al., 31 Jul 2025). It introduces several substantial enhancements:

Extended Prompting Mix: Aggregated "failed attempts" and proof sketches from resource-intensive runs, increasing sample efficiency by 12%.
Dynamic Lemma Scoring: Lemma-pool heuristics now factor in proof-rate, semantic relevance, and proof length, which increased final-proof success by approximately 6 percentage points on IMO hardest problems.
Adaptive Budget Allocation: A lightweight bandit-style allocator grants greater refinement resources to lemmas with higher proof variance, reducing average solve time by 25%.
LooKeng Interface: Statelss Python–Lean REPL supporting memory tracking and high concurrency, reducing I/O latency by around 40%.
Parameter Expansion: Inclusion of a 128-head mixture-of-experts layer in the Transformer increases chain-of-thought coherence and per-step log-probabilities (+0.8 nats), directly improving complex reasoning capacity.

Collectively, these changes raised PutnamBench medium-budget solves by +5 pp and reduced heavy inference compute by approximately 30% (Chen et al., 31 Jul 2025).

6. Qualitative Insights, Limitations, and Future Directions

Qualitative analysis demonstrates that Seed-Prover 1.5 achieves hierarchical decomposition on complex problems by generating intermediate lemmas (e.g., for functional equations and challenging algebra), with Mathlib search and external tool use guiding the formalization. Notable failure cases occur on PhD-level (Fate-X) problems, primarily due to missing specialized library formalisms in Mathlib, and in situations requiring proof contexts exceeding 32K tokens, which stress the current RL architecture.

Prospective research directions identified include:

Automatic formalization of research literature to enrich Mathlib, resolving coverage deficiencies detected during Fate-X evaluation.
Iterative self-play to acquire hard examples and further scale RL-based proficiency.
Rubric-based RL to move beyond binary reward, optimizing for sketch quality and other fine-grained proof attributes.
Advanced tree search to supplement agentic reasoning, e.g., best-first exploration.

7. Significance and Broader Impact

Seed-Prover 1.5 exemplifies the synthesis of large-scale agentic RL with rigorous formal verification and efficient, scalable multi-tiered search, attaining performance on challenging formal mathematical problems previously inaccessible to automated methods. Its modular design—combining natural-language, sketch-driven, and agentic Lean models—offers an extensible blueprint for future work on automated reasoning, formalization, and mathematical discovery (Chen et al., 19 Dec 2025, Chen et al., 31 Jul 2025).