BFS-Prover-V2: Scalable Theorem Prover
- BFS-Prover-V2 is a scalable system that integrates multi-turn off-policy reinforcement learning with hierarchical tree search for formal proofs and long-horizon reasoning.
- It employs tactic-level curriculum filtering based on perplexity and periodic retraining to enhance model refinement and prevent overfitting.
- The architecture uses a planner to decompose theorems and parallel prover agents with a shared subgoal cache, achieving state-of-the-art benchmarks on MiniF2F and ProofNet.
BFS-Prover-V2 is a scalable automatic theorem proving system that integrates multi-turn off-policy reinforcement learning (RL) with a planner-enhanced multi-agent tree search architecture to address the dual challenges of scaling training-time RL and inference-time computation for LLM-based step-provable agents (Xin et al., 8 Sep 2025). The system builds upon prior work in BFS-based tree search for formal mathematics and introduces several algorithmic and architectural advances that yield state-of-the-art results on formal proof benchmarks and offer broader applicability for long-horizon multi-turn reasoning in other domains.
1. Multi-Turn Off-Policy Reinforcement Learning Framework
BFS-Prover-V2 treats stepwise theorem proving as a Markov Decision Process (MDP), wherein each "state" encodes the current Lean tactic state (hypotheses, goals), and each "action" is a tactic string generated by the LLM. A deterministic transition function (as defined by the Lean compiler) updates the state, and the reward is assigned as if lies on a successful proof path and $0$ otherwise. The policy is trained to maximize expected cumulative reward over multi-turn episodes.
The RL framework iteratively alternates between:
- Proof Generation: The current LLM expert explores proof paths on a corpus of 3 million autoformalized problems using BFS-based tree search, collecting successful (state, tactic) pairs.
- Model Refinement: These pairs are used to further train the LLM via supervised fine-tuning, effectively performing expert iteration.
Adaptive Tactic-Level Data Filtering
Unlike previous approaches that filter entire proofs, BFS-Prover-V2 uses the model's own uncertainty as measured by tactic perplexity. The empirical Gaussian-like perplexity distribution is divided into three regions; training focuses on strategies in the "Goldilocks" window (intermediate perplexity), discarding tactics that are either too easy or too noisy. Formally, filtering criteria are
where is the model likelihood.
Periodic Retraining (“Soft Reset”)
To surmount expert iteration plateaus—often caused by over-specialized proof styles—the system periodically resynthesizes and re-curates proofs from the entire problem set, aggressively filtering tactics before retraining from a base checkpoint. This rejuvenates the model and prevents entrenchment in local optima.
2. Planner-Enhanced Multi-Agent Tree Search
At inference time, BFS-Prover-V2 employs a planner-enhanced hierarchical architecture:
- Planner LLM decomposes a target theorem into intermediate subgoals, analogously to human proof planning.
- Prover Agents: Teams of parallel prover agents work on a shared subgoal cache. Each agent focuses exclusively on the active subgoal.
- When a subgoal is proven by any agent, its result is cached and propagated so that redundant work is eliminated across agents.
- If an agent exhausts its search budget before succeeding, the system re-invokes the Planner to further break down difficult subgoals (dynamic replanning).
This design dramatically reduces the working search space, increases parallel efficiency, and enables hierarchical reasoning. The flow—initial planning, parallel proof, context augmentation, and dynamic replanning—echoes human mathematical workflow.
3. Performance on Formal Benchmarks
BFS-Prover-V2 demonstrates substantial improvements on standard formal mathematics test sets: | Benchmark | Score (%) | Notes | |---------------|-----------|------------------------------| | MiniF2F | 95.08 | + Planner yields 95.49% | | ProofNet | 41.4 | Validation set |
These results establish state-of-the-art performance for incremental tactic-based approaches and outperform methods based on whole proof generation or non-hierarchical search. The accumulative and planner-enhanced scores highlight the effect of hierarchical decomposition and parallel agent collaboration.
4. Algorithmic Details and Data Structures
Stepwise Proving as MDP
Each interaction with Lean4 is modeled as a step in an MDP. The state space is the set of Lean tactic states, transitions are deterministic as defined by the Lean kernel, and the agent responds with tactic strings. For each MDP episode (proof attempt), the agent receives reward at each state-action if it lies on a successful proof path.
Tactic Perplexity-Based Curriculum
Filtering the training data by perplexity ensures the model refines its policy on meaningful tactics rather than overfitting simplistic steps.
Subgoal Cache for Parallel Proof Search
The subgoal cache is a central shared data structure for all agents, recording subgoal hashes and statuses (Pending, Proving, Proven), and enabling fast de-duplication and dynamic context tracking.
5. Broader Implications and Domain Generalization
While developed for formal mathematics in Lean4, BFS-Prover-V2's dual scaling strategy—multi-turn off-policy RL combined with hierarchical planner/prover decomposition and parallel agent tree search—is broadly applicable to domains requiring:
- Long-horizon multi-turn reasoning (e.g., program synthesis with intermediate specifications, multi-step robotic planning, compositional natural language tasks).
- Large, branching discrete search spaces that benefit from dynamic curriculum construction and hierarchical decomposition.
The use of model-driven tactic uncertainty as a curriculum signal and the periodic retraining are general RL techniques potentially valuable for sequential decision tasks in LLMing and planning.
BFS-Prover-V2's architecture, by separating high-level planning (subgoal decomposition) from low-level strategy generation (tactic selection), enhances both efficiency and explainability, facilitating scalable search and human-auditable reasoning. This structural separation is a plausible direction for future extensions in machine reasoning.
6. Comparative Assessment and Future Directions
BFS-Prover-V2 advances beyond BFS-Prover, MCTS-based search, and value-guided approaches by:
- Removing reliance on critic/value models at inference time.
- Leveraging distributed, parallel infrastructure and hierarchical decomposition.
- Integrating dynamic curriculum construction and periodic model resets.
Future avenues include:
- More refined exploitation/exploration tradeoffs, possibly through adaptive parameter schedules for tactic selection.
- Broadening curriculum design signals, e.g., incorporating tactic diversity or contextual novelty.
- Exploring modular integration of BFS with critic-based or MCTS search primitives for hybrid approaches.
- Adapting the system to domains outside of formal mathematics to validate generality.
7. Summary Table: Key Features of BFS-Prover-V2
Feature | Description | Source |
---|---|---|
Multi-turn off-policy RL | Expert iteration with tactic-level filtering and soft resets | (Xin et al., 8 Sep 2025) |
Hierarchical planner–prover | Planner decomposes into subgoals, agents prove in parallel | (Xin et al., 8 Sep 2025) |
Shared subgoal cache | Deduplicates subgoal proof, propagates successful proofs | (Xin et al., 8 Sep 2025) |
State-of-the-art benchmarks | 95.08% (MiniF2F), 41.4% (ProofNet) | (Xin et al., 8 Sep 2025) |
Domain generalization | Applicable to multi-turn planning and other reasoning tasks | (Xin et al., 8 Sep 2025) |
BFS-Prover-V2 thus represents an overview of RL curriculum design, hierarchical decomposition, and scalable parallel tree search, delivering highly competitive results in proof automation and providing a template for broader long-horizon reasoning systems.