Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 110 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

StepFun-Prover: Tool-Integrated Lean Prover

Updated 26 August 2025
  • StepFun-Prover is a system that integrates Lean 4’s formal feedback with large language model reasoning to iteratively generate and verify proofs.
  • It employs a tool-integrated reinforcement learning framework that dynamically refines proof strategies via real-time REPL feedback and error recovery.
  • The system achieves a high pass@1 rate of 70% on miniF2F benchmarks, advancing interactive formal verification and Math AI capabilities.

StepFun-Prover is a modern automated theorem proving system that integrates tool-based formal verification with LLM–driven problem solving. It is designed to interactively generate and check formal Lean 4 proofs by leveraging a tightly coupled reasoning pipeline. The key innovation is its ability to “think” step-by-step while receiving real-time feedback from a formal proof assistant, iteratively refining its output in a closed tool–model–environment loop. StepFun-Prover achieves strong pass@1 rates on established formal mathematics benchmarks and introduces an end-to-end training architecture for Math AI assistants capable of tool-integrated reasoning.

1. Tool-Integrated Reasoning Framework

StepFun-Prover positions LLM reasoning at the center of an interactive formal verification workflow. The core mechanism interleaves natural language thought sketches and executable Lean 4 code segments within a single reasoning trajectory. For each proof attempt, the system

  • Produces a Lean 4 code block (enclosed in specialized tags, e.g., <sketch>...</sketch>)
  • Executes the code via a real-time Lean 4 REPL server, which returns feedback (success state or error/warning, annotated as <REPL>...</REPL>)
  • Integrates this feedback into subsequent reasoning steps

This dialogic process is fundamentally different from static chain-of-thought generation: StepFun-Prover adapts its proof trajectory based on Lean’s execution feedback. As a result, error diagnosis and proof strategy selection can occur in situ—reminiscent of how human mathematicians iterate using pen, paper, and real-time checking.

Substantial engineering work underlies the Lean 4 backend, including asynchronous communication and concurrent session support, enabling scalable multi-turn REPL exchanges.

2. Reinforcement Learning and Training Protocol

The model is trained using a hybrid regime that incorporates multi-stage supervised fine-tuning (SFT) and reinforcement learning (RL) with tool-based interaction. The stages are:

  1. Cold-Start Phase: Synthesis of initial problem–solution data using mixed human and prior-model sources; basic formal syntax and proof logic are learned here.
  2. Supervised Fine-Tuning: First on curated formal mathematics datasets (e.g., Lean Workbook, STP), then further refinement using the synthesized cold-start set.
  3. Response Pattern Fusion: Harmonizes diverse errorful output styles by cross-training on mistaken response trajectories, blending robustness from multiple reasoning styles.
  4. Tool-Based RL Rollouts: The model generates a mix of natural language and Lean 4 code; after the code block ends (by marker </think>), the Lean 4 REPL is invoked to verify proof success.

Reward is assigned on a binary basis (accepted/rejected), with group-based advantage normalization:

Ai=rimean({r1,r2,rG})std({r1,r2,rG})A_i = \frac{r_i - \text{mean}(\{ r_1, r_2, \ldots r_G \})}{\text{std}(\{ r_1, r_2, \ldots r_G \})}

J(θ)=E(q,{oi})[1Gi=1Gmin(πθ(oiq)πθold(oiq)Ai,clip(πθ(oiq)πθold(oiq),1ϵ,1+ϵ)Ai)]J(\theta) = \mathbb{E}_{(q, \{ o_i \})}\left[ \frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)} \cdot A_i, \text{clip}\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) \cdot A_i \right) \right]

This PPO-like update (but without explicit KL regularization) rewards correct problem-solving trajectories as verified by external formal tools.

Subsequent cycles of RL rollout and SFT further focus the learning objective: only robust, error-sensitive proof attempts on hard problems are used for supervision, magnifying the system’s tool-based generalization.

3. Stepwise Proof Search and Error Recovery

A haLLMark of StepFun-Prover is incremental, multi-phase proof construction. Rather than generating the complete formal proof in one pass (as in traditional static systems), it proceeds in discrete steps:

  • Each step outputs a partial proof or tactic invocation given the current state
  • Lean feedback guides correction or advancement
  • Encountered errors prompt local backtracking and revision, not global proof regeneration

This stepwise approach is supported by explicit reasoning tokens and environmental annotations, allowing the system to maintain a “proof stack” of partially verified solutions. It aligns closely with results from sentence-level autoformalization systems (Hu et al., 12 Jun 2025), which demonstrate superior efficiency and robustness for granular verification over whole-proof strategies.

4. Performance Benchmarks and Metrics

On the miniF2F-test benchmark—a collection of Olympiad-level Lean formalization challenges—StepFun-Prover attains a pass@1 rate of 70.0%. Pass@1 measures the fraction of problems solved correctly on the first attempt, with no resampling. Relative to previous tool-integrated or static provers (e.g., DeepSeek-Prover-V2-671B at 61.9%, Kimina-Prover-72B at 63.9%), this efficiency is high: typical systems require a large number of attempts and extensive sampling to achieve comparable rates.

$\text{Pass@1} = \frac{\text{# correct on first try}}{\text{# total test problems}}$

This sample efficiency is attributed to the reinforcement learning regime, dynamic feedback integration, and stepwise error recovery.

5. Tool-Based Training Pipeline and Data Integration

The end-to-end training pipeline begins with synthetic and human-generated data, aggregated using multi-turn tool-verified dialogues. Cold-start datasets are iteratively refined by integrating new proof attempts, especially focusing on hard problems that were initially unsolvable. This form of curriculum learning—filtering for robust response trajectories that show meaningful interaction with Lean feedback—enables concentrated improvement where the model is weakest.

Training exploits multi-modal data (natural language and formal code), explicit execution traces, environment feedback tags, and supports distributed reasoning modes. The REPL backend is optimized for high concurrency and low-latency, supporting rapid multi-turn evaluation.

6. Impact and Future Directions

StepFun-Prover establishes a prototype for Math AI assistants capable of end-to-end tool-integrated reasoning. This system demonstrates lasting efficiency gains by moving from rigid, static chain-of-thought approaches to dynamic, feedback-driven methods. The iterative training framework, integration of formal verification tools, and high pass@1 rates suggest that similar architectures may drive advances in:

  • Large-scale theorem library formalization (by lowering the barrier for automated formalization of complex math)
  • Interactive mathematical tutoring and education (via stepwise error feedback and correction)
  • Generalizable Math AI agents (by exporting the tool-integration paradigm to other formal systems beyond Lean)

Broader adoption may eventually lower technical entry barriers for formal methods practitioners and mathematical researchers.

7. Limitations and Ongoing Challenges

While StepFun-Prover achieves sample-efficient, high-precision results, several constraints remain. Stepwise proof generation still depends on reliable feedback from external tools; improvements in Lean 4 REPL robustness and error trace interpretation are critical. The reinforcement learning protocol is highly sensitive to reward sparsity—binary accept/reject signals—potentially limiting nuanced learning in ambiguous cases. Curriculum filtering and tool backend optimizations reduce training bottlenecks, but scaling to diverse domains and handling highly non-linear proofs remain active areas of investigation.

A plausible implication is that further integration with autoformalizers—such as StepFun-Formalizer (Wu et al., 6 Aug 2025)—could link informal-to-formal translation directly into the stepwise verified reasoning pipeline, advancing the prospects of universally accessible end-to-end mathematical AI systems.