Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving (2507.02726v1)

Published 3 Jul 2025 in cs.AI and cs.LG

Abstract: Reasoning remains a challenging task for LLMs, especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

Summary

  • The paper presents a novel sG-MDP framework that dynamically generates subgoals to provide denser rewards during automated theorem proving.
  • It employs an MCTS-inspired algorithm with an ensemble of 7B LLMs and Lean verification to improve exploration and proof search efficiency.
  • The paper achieves state-of-the-art results on the PutnamBench dataset, demonstrating enhanced success rates and sample efficiency over previous 7B-scale models.

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

The paper "Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving" (2507.02726) presents a novel approach to automated theorem proving (ATP) by introducing the self-generated goal-conditioned Markov Decision Process (sG-MDP) framework. This framework enables agents to dynamically generate and pursue subgoals during proof search, addressing the challenges of sparse rewards and the combinatorial complexity inherent in formal mathematical reasoning, particularly on benchmarks such as PutnamBench.

Problem Context and Motivation

Automated theorem proving in formal systems like Lean, Isabelle, and Coq is a longstanding challenge in AI, requiring multi-step, logically coherent reasoning. While LLMs have demonstrated some capacity for mathematical reasoning, their performance is limited by the sparse reward landscape of theorem proving: feedback is typically only available upon completion of a full proof, making it difficult to guide exploration or learn effective heuristics. This issue is especially pronounced in complex benchmarks like PutnamBench, which features university-level problems with long proof horizons.

Traditional approaches, including symbolic methods and neural-guided search, have made progress but struggle with the combinatorial explosion of possible proof paths. Recent work has explored reinforcement learning (RL), expert iteration, and Monte Carlo Tree Search (MCTS) to improve search efficiency, but these methods often rely on predefined subgoals or require extensive training data and compute.

Self-Generated Goal-Conditioned MDPs

The core contribution of the paper is the formalization of sG-MDPs, which extend standard goal-conditioned MDPs by allowing the agent to generate its own subgoals dynamically, conditioned on the evolving proof state. In this setting:

  • States represent proof states (e.g., Lean proof environments).
  • Actions include both primitive proof tactics and the creation of new subgoals (conjectures).
  • Goals are not fixed in advance but are generated on-the-fly by the agent, reflecting the human-like strategy of decomposing complex proofs into manageable intermediate steps.
  • Transition and reward functions are adapted to account for both progress toward the main theorem and the independent verification of subgoals.

This formulation enables a denser reward structure, as agents receive feedback not only for completing the overall proof but also for successfully proving intermediate conjectures. The sG-MDP framework is instantiated in Lean using the PyPantograph interface, which supports dynamic subgoal creation and validation.

Monte Carlo Tree Search with sG-MDPs

To solve the sG-MDP, the authors employ an MCTS-like algorithm, where:

  • Selection uses UCB to balance exploration and exploitation over proof states and subgoals.
  • Expansion queries the policy model (an ensemble of 7B LLMs) for tactic and subgoal candidates, validated in Lean.
  • Estimation assigns value to nodes based on the number of solved conjectures and proof depth, providing denser feedback than terminal-only rewards.
  • Back-propagation updates visit counts and accumulated values along the search path.

This approach does not rely on pretrained critics or value functions, instead leveraging direct verification in Lean for reward estimation. The modularity of the framework allows for ensembling multiple LLMs, increasing robustness and diversity in proof search.

Experimental Results

The Bourbaki system, instantiated with an ensemble of DeepSeek-Prover-v2–7B and Kimina-7B, is evaluated on the PutnamBench dataset. Key results include:

  • State-of-the-art performance for 7B models: Bourbaki solves 26 out of 658 problems at pass@512, surpassing previous 7B-scale models such as Kimina-7B (10/644) and DeepSeek-Prover-V2 (23/658 at pass@1024).
  • Improved sample efficiency: Bourbaki achieves higher success rates with fewer samples compared to baselines, and discovers proofs not found by the base models even at higher sample budgets.
  • Enhanced proof diversity: The system generates a greater variety of correct proofs per theorem, indicating more effective exploration of the proof space.
  • Generalizability: The sG-MDP framework improves performance when layered on top of other provers (e.g., STP, DeepSeek-Prover-V2), demonstrating its utility as a general mechanism for structured exploration.

Implications and Future Directions

The introduction of sG-MDPs represents a significant step toward more human-like, structured reasoning in automated theorem proving. By enabling dynamic subgoal generation and leveraging intermediate conjectures for denser reward signals, the approach addresses key limitations of prior methods in sparse-reward environments.

Practical implications include:

  • Improved ATP systems: The framework can be integrated with existing proof assistants and LLM-based provers, enhancing their ability to tackle complex, multi-step problems with limited compute.
  • Modular ensembling: The ability to ensemble multiple LLMs within the sG-MDP framework allows for leveraging diverse model strengths and behaviors, potentially improving robustness and generalization.
  • Sample efficiency: The denser reward structure and structured exploration can reduce the computational resources required for proof search, making ATP more accessible.

Theoretical implications and future research directions:

  • Generalization to other domains: The sG-MDP framework could be applied to other structured reasoning tasks with sparse rewards and hierarchical objectives, such as program synthesis or formal verification.
  • Integration with learned value functions: While the current implementation eschews pretrained critics, future work could explore hybrid approaches that combine Lean-verifier feedback with learned value estimators for further efficiency gains.
  • Scaling to larger models and datasets: As LLMs continue to scale, integrating sG-MDPs with larger models and more diverse mathematical corpora may yield further improvements in ATP capabilities.

Conclusion

The Bourbaki system demonstrates that self-generated goal conditioning, combined with MCTS and LLM ensembling, can substantially advance the state of the art in automated theorem proving at moderate model scales. The sG-MDP framework provides a principled foundation for structured exploration and reward shaping in formal reasoning tasks, with promising implications for both practical ATP systems and the broader paper of machine reasoning.

Youtube Logo Streamline Icon: https://streamlinehq.com