Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Seed-Prover: LLM-Based Theorem Prover

Updated 2 August 2025
  • Seed-Prover is an automated theorem proving system that uses LLMs to generate and verify lemma-style proofs in Lean 4.
  • It utilizes a multi-stage reinforcement learning pipeline with chain-of-thought summaries to explore deep and broad reasoning strategies.
  • The system integrates a specialized geometry reasoning engine to bridge formalization gaps in complex geometric problems for IMO-level contests.

Seed-Prover is an automated theorem proving system based on LLMs and formalized lemma-style proof construction in Lean 4. It advances the state of automated theorem proving by combining long chain-of-thought generation, multi-stage formal verification, and a modular lemma pool architecture to achieve both depth and breadth in reasoning on mathematical contest problems. Seed-Prover leverages formal supervision and iterative feedback from Lean’s verifier, and is further enhanced with an optimized geometry reasoning engine to overcome domain gaps in geometry formalization.

1. Model Architecture and Proof Construction

Seed-Prover operates in a whole-proof, lemma-style regime—contrasting with step-level generation approaches. For a given theorem, it constructs a sequence of auxiliary lemmas, each encapsulated as a Lean lemma block. These intermediate lemmas are explicitly formalized, named, and stored in a lemma pool along with their proofs, meta-information on proof difficulty, dependency links, and success status.

The model’s output thus has a modular structure: it generates initial conjectures and lemmas, attempts to prove each with Lean, and records both proved results and failed attempts. By reusing proved lemmas and iteratively refining subproofs, Seed-Prover adapts its global reasoning trajectory, employing flexible combinations of successful components and exploring alternative pathways for unresolved subgoals. This design enables both fine-grained (deep) exploration and combinatorial (broad) search.

The platform integrates a self-summarization capability: at each iteration, it generates natural language and/or formal summaries of progress, enabling context-aware reuse of partial results and improved management of complex proof states.

2. Training Paradigm and Reinforcement Learning

Seed-Prover is trained via a multi-stage reinforcement learning pipeline based on the VAPO framework. The reward function is strictly binary: a reward of 1 for proofs accepted by Lean’s formal verifier, 0 otherwise. The RL process uses Lean’s feedback to identify syntax errors, misapplied lemmas, or incomplete proofs, and propagates this feedback back into the training pool.

Lost proofs and failed attempts are not discarded; instead, the model is specifically trained to leverage the information contained in partial summaries, failed conjectures, and the status of lemmas (proven or otherwise) in further iterations. This continual process allows the system to exploit both positive and negative training signals and adapt strategies dynamically.

Training prompts are augmented with Lean’s compiler messages, brief natural language hints, and structured summaries of chained proof attempts (i.e., chain-of-thought traces). This allows for resilient RL-based learning even in domains with high structural error rates.

3. Iterative Inference and Problem-Solving Strategies

Seed-Prover utilizes three structured test-time inference regimes that balance sample efficiency and proof coverage, scaling to IMO-level contest mathematics:

  • Light Inference: Each proof attempt is permitted up to 8–16 consecutive refinements. After each Lean feedback, the model edits only the affected region of the proof, correcting the current attempt or switching tactics. The effective search cost is comparable to Pass@128, yielding robust error correction without inflating the sample budget.
  • Medium Inference: Introduces an inner refinement loop for hard-to-prove lemmas, nested within the overall proof refinement loop. The main proof structure is refined, while sub-lemmas are recursively attempted under tighter budgets, and successful subproofs are spliced into the main proof.
  • Heavy Inference: Generates thousands of candidate conjectures (potential auxiliary properties or sublemmas) to populate an expanded conjecture pool. Each conjecture is then attempted in parallel. Proven lemmas are added to the main lemma pool, ranked by metrics such as proof rate and semantic relevance. Top-ranked results are subsequently integrated in the medium-inference regime to synthesize the final proof.

This multiphase approach enables both breadth—systematically exploring wide conjecture spaces—and depth—refining and improving upon complex subproofs through iterative feedback. Empirically, the use of a large conjecture pool and staged refinement identifies rare but essential intermediate results that may be missed by purely sequential or flat sampling.

4. Benchmark Performance and Quantitative Evaluation

Seed-Prover demonstrates state-of-the-art performance on a suite of formal mathematical benchmarks:

Benchmark Metric Seed-Prover Score Reference SOTA
Formalized IMO problems Proof success rate 78.1% lower (<50%)
MiniF2F (test) Proof success rate 99.6% lower
MiniF2F (val, medium setting) Proof success rate 100% lower
PutnamBench Proofs out of 657 331 Goedel-Prover-V2 (lower)

On IMO-level problems, Seed-Prover's success rate outstrips previous models by a margin exceeding 3×. On MiniF2F, the model saturates both the test and validation splits. On PutnamBench, a challenging undergraduate-level formal benchmark, it proves 331 statements under the medium inference setting, establishing a new empirical baseline.

The ranking of conjectures and lemmas is based on composite metrics that account for proof rate (the ratio of successful proofs to attempts for each lemma), semantic relevance to the problem statement, and raw proof length. This enables the system to prioritize conjectures that meaningfully advance the main proof.

5. Geometry Reasoning Engine: Seed-Geometry

A significant extension of Seed-Prover is the Seed-Geometry subsystem, which addresses Lean’s traditional limitations in geometry proof automation. Seed-Geometry implements a domain-specific language (DSL) for geometric constructions, particularly those arising from ruler-and-compass problems. It groups common composite steps (e.g., isogonal conjugates, exsimilitude centers) into concise statements to reduce proof verbosity and complexity.

The backend has been re-engineered in C++ (with Python bindings), improving search throughput by approximately two orders of magnitude over previous Python-based implementations. Broad beam search, distributed over multiple machines, is used to generate millions of auxiliary geometry problems. Seed-Geometry supports fast forward-chaining over complex geometric constructions, enabling effective bridging of gaps in Lean geometry formalization.

This engine played a pivotal role in the system's success at IMO 2025, contributing to 5 out of 6 formalized problem solutions.

6. Applications, Implications, and Future Outlook

Seed-Prover exemplifies a scalable framework for automated mathematical reasoning, especially on complex, structured proof domains such as international mathematical olympiads and undergraduate contests. Its modular, lemma-oriented architecture facilitates interpretable and reparable proof strategies, while chain-of-thought and RL mechanisms enable robust learning from both correct and incorrect attempts.

The explicit integration of formal verification feedback is essential to both model robustness and supervision efficiency, as natural language–only signals (chains-of-thought without formal execution) are insufficient for stable optimization at this difficulty level.

The architecture’s lemma pool, self-summarization, and staged refinement are crucial for maintaining progress on problems where intermediate results are rare or require exploration outside standard templates. This architectural modularity is extensible: future versions can incorporate additional domain-specialized engines (analogous to Seed-Geometry), further extending the regime of problems addressable by fully automated theorem provers.

Seed-Prover’s success suggests a trajectory toward AI systems capable of discovery, formalization, and verification of results beyond the reach of traditional mathematical automation. The lemma-style framework and multi-stage feedback paradigms are likely to influence further developments in mathematical AI, proof engineering, and interactive verification systems.