Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 108 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 218 tok/s Pro
2000 character limit reached

Self-play Theorem Prover (STP) Overview

Updated 21 August 2025
  • Self-play Theorem Prover (STP) is an automated framework that generates and refines mathematical conjectures and proofs through self-interaction and reinforcement learning.
  • It employs an iterative loop combining conjecture generation and proof verification, using metrics like empirical pass rate to maintain challenging yet solvable problem sets.
  • Leveraging methods such as Monte Carlo Tree Search and tool-integrated feedback, STP enhances proof accuracy and sample efficiency while reducing reliance on human-curated data.

A Self-play Theorem Prover (STP) is an automated theorem proving system that autonomously improves its reasoning capabilities by generating, solving, and iteratively refining mathematical problems through self-interaction. The central principle is that the prover both produces new conjectures and learns to prove them, constructing a closed feedback loop where its own outputs become its primary source of training data. Rooted in reinforcement learning and expert iteration paradigms, STP frameworks draw direct inspiration from mathematical practice, harnessing the duality of problem invention and proof discovery to address the scarcity and sparsity problems endemic to supervised formal reasoning.

1. Foundational Principles and Design

The STP system is defined by its ability to integrate conjecturing (problem generation) and proving (solution synthesis) into a single LLM-driven process (Dong et al., 31 Jan 2025). This is operationalized via two roles:

  • Conjecturer: Generates new mathematical statements (conjectures), seeded either by existing theorems or by proofs already known to the system.
  • Prover: Attempts to prove both dataset-originated and freshly generated conjectures.

The interaction is governed by an empirical pass rate metric, ℙ̂(c):

P^(c)=#{i:ci=c,pc,i correct}#{i:ci=c}\hat{\mathbb{P}}(c) = \frac{\# \{ i : c_i = c, p_{c, i}~\text{correct} \}}{ \# \{ i : c_i = c \} }

Only conjectures with a pass rate in a challenging but feasible interval (e.g., (0,1/4](0, 1/4]), and passing additional qualitative filters (“elegancy”), are recycled to train the conjecturer. This mechanism ensures the curriculum always delivers problems at the edge of current prover capability, sustaining meaningful learning.

2. Algorithmic Loop: Iterative Conjecturing and Proving

STP proceeds in discrete phases:

  1. Seed Selection: Input is either a human-written theorem or previously solved statement, together with its proof and typically a lemma.
  2. Conjecture Generation: The conjecturer proposes new related statements, often by manipulating problem structures or exploring near-misses from the proof.
  3. Proof Sampling: The prover, using expert iteration or reinforcement learning, tries to prove all available statements, performing multiple independent proof attempts per conjecture.
  4. Reward Assignment: Conjectures are scored by empirical pass rate. Those close to "barely provable" are returned as dense training signals; others are filtered out.
  5. Model Update: Successes and failures update both the conjecturer (to propose more nuanced problems) and prover (to better handle difficult statements).

This loop simulates the creative-discovery cycle in mathematical research and directly mitigates the sparse reward bottleneck common in formal reasoning, allowing rapid accretion of both skill and knowledge.

3. Reinforcement Learning and Expert Iteration

STP leverages RL and expert iteration techniques, aligning with strategies employed in prior provers but extending them with a dynamic, task-self-generating approach (Kaliszyk et al., 2018, Wu et al., 2021, Lample et al., 2022).

  • Monte Carlo Tree Search (MCTS) Integration: Early systems formulate theorem proving as sequential decision-making, where search branches are explored via MCTS and guided by learned policy and value networks.
  • MDP and Backtracking: In interactive settings (e.g., TacticZero), the proof state is formalized as a Markov decision process. The inclusion of stateful backtracking allows efficient escape from dead-end branches.
  • Online Self-play: Later systems (e.g., HTPS) implement distributed asynchronous self-play, where multiple agents concurrently prove statements and collectively update policy and critic networks via backpropagation through hypertree structures.

These methods combine exploitation of promising moves with exploration of novel strategies, enabling STP to generalize from synthetic data to real-world mathematical libraries.

4. Curriculum Generation and Data Efficiency

A distinguishing feature of STP is automated curriculum generation:

  • Synthetic Data Synthesis: Provers accelerate training by generating synthetic problems that reflect increasing problem difficulty and structural diversity (Firoiu et al., 2021). For example, forward proposers controlled by clause size softmax sampling yield nontrivial theorems.
  • Feedback-driven SFT and RL: Conservative reward assignment (e.g., via TP-as-a-Judge or direct theorem prover execution feedback [(Leang et al., 18 Feb 2025), StepFun-Prover (Shang et al., 27 Jul 2025)]) replaces expensive human annotation, meaning models can scale training with minimal need for externally curated datasets.
  • Efficiency Metrics: Pass rates are often reported as pass@k\operatorname{pass}@k; e.g., STP achieves 65.0% (pass@3200) on miniF2F-test and 23.9% on ProofNet-test (Dong et al., 31 Jan 2025). Tool-integrated provers (StepFun-Prover) reach 70.0% pass@1 on miniF2F-test, signifying high sample efficiency (Shang et al., 27 Jul 2025).

The result is robust transfer from synthetic to human-written problems, a crucial milestone for the applicability of self-play approaches under real-world data constraints.

5. Comparison to Conventional and Tree Search Provers

Expert iteration and tree search-based methods previously dominated the space, but plateau due to reward sparsity and over-reliance on static datasets.

  • Expert Iteration Limitations: For prior LLM-based provers, up to 98.5% of generated proofs are incorrect, and pass rates plateau as the available pool of correct proofs saturates (Dong et al., 31 Jan 2025). STP, by generating new training signals at the cusp of provability, achieves a twofold increase in solve rates (e.g., 28.5% on LeanWorkbook vs 13.2% for expert iteration).
  • Whole-Proof Generation Advantages: In contrast to tree search architectures (which rely on step-level generation and explicit value function evaluation), STP’s approach for "whole-proof" generation produces denser, more informative learning signals.

These findings underline that adaptive, self-generated curricula are necessary for overcoming the inherent glass ceiling of conventional formal reasoning pipelines.

6. Integration with Formal Reasoning Tools and Verification

Recent STP-inspired systems tightly couple LLM reasoning with formal verifiers:

  • Tool-Integrated Reasoning: Models interact with formal proof environments (e.g., the Lean 4 REPL in StepFun-Prover), generating proofs fragment by fragment, receiving real-time feedback, and updating reasoning accordingly (Shang et al., 27 Jul 2025).
  • Autoformalization and Step-by-Step Verification: Sentence-level decomposition (StepProof (Hu et al., 12 Jun 2025)) and iterative autoformalization (TP-as-a-Judge (Leang et al., 18 Feb 2025)) further increase proof accuracy and robustness, localizing errors for granular correction and efficiently filtering synthetic data.
  • Performance Impact: StepFun-Prover demonstrates the effectiveness of RL-based tool interaction, reporting 70% pass@1 with strong sample efficiency.

These integrations bridge the gap between natural language reasoning and machine-verifiable logic, narrowing the gap between human-level and machine-level formalization.

7. Future Directions and Open Challenges

STP frameworks establish a foundation for continued progress in automated mathematical reasoning:

  • Refinement of reward functions and conjecture elegancy metrics to further enhance curriculum quality and diversity.
  • Scalability beyond current benchmarks to richer formal languages (e.g., expansion from Lean and Isabelle to other verification environments).
  • Combining whole-proof generation with stepwise tree search architectures for hybrid solving.
  • Addressing overfitting concerns and generalization—the adaptive curriculum must avoid reinforcing idiosyncratic proof styles not present in wider mathematical practice.
  • Enhancing the robustness of tool-integrated feedback mechanisms to handle complex dependencies in multi-step proofs.

A plausible implication is that STP methodologies will enable provers to autonomously improve without continuous human data curation, shaping the next generation of AI-powered formal proof assistants.


STP represents a paradigm shift in formal theorem proving where autonomous systems invent, prove, and refine mathematical problems in a closed loop, guided by reinforcement-mechanistic and tool-integrated feedback, with demonstrably superior performance on standard proof benchmarks.