- The paper introduces a self-generated, goal-conditioned MDP framework that dynamically creates subgoals to overcome sparse rewards in theorem proving.
- It integrates multiple 7B-parameter LLMs with Monte Carlo Tree Search and Lean verification to enhance proof search efficiency and diversity.
- The approach establishes new benchmarks on PutnamBench by demonstrating improved success rates and modular adaptability in automated theorem proving.
Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving
The paper presents Bourbaki, a modular automated theorem proving (ATP) system that introduces the self-generated goal-conditioned Markov Decision Process (sG-MDP) framework for formal mathematical reasoning. The work addresses the persistent challenge of sparse rewards and long proof horizons in ATP, particularly in the context of university-level problems as formalized in the PutnamBench dataset. The authors propose a principled approach that enables agents to dynamically generate and pursue subgoals, thereby structuring the proof search and providing denser intermediate feedback.
Technical Contributions
The central technical innovation is the formalization of sG-MDPs, which generalize standard goal-conditioned MDPs by allowing the agent to generate subgoals on-the-fly, conditioned on the evolving proof state. This is operationalized in the context of Lean theorem proving, where subgoals correspond to valid Lean statements (conjectures) that must themselves be formally proven. The sG-MDP framework augments the action space to include both primitive proof actions (tactics) and subgoal-creation actions, and defines a transition and reward structure that supports intermediate verification of conjectures.
Bourbaki instantiates this framework by integrating multiple 7B-parameter LLMs (specifically, DeepSeek-Prover-v2–7B and Kimina-7B) for subgoal generation and tactic synthesis. The system leverages Monte Carlo Tree Search (MCTS) to explore the space of proof trajectories, using Lean’s formal verifier (via Pantograph) to validate both tactics and conjectures at each node. The reward function is augmented to provide credit for independently verified subgoals, not just completed proofs, which mitigates the sparse-reward problem endemic to ATP.
Empirical Results
Bourbaki is evaluated on the PutnamBench dataset, which comprises 658 formalized Putnam Competition problems in Lean 4. The system achieves a new state-of-the-art for 7B-scale models, solving 26 problems at pass@512, compared to the previous best of 23/658 by DeepSeek-Prover-v2–7B at pass@1024 and 10/644 by Kimina-7B. Notably, Bourbaki demonstrates improved sample efficiency, solving more problems with fewer queries, and generates a greater diversity of correct proofs per theorem. The system also consistently improves the performance of base models (e.g., STP and DeepSeek-Prover-v2) under fixed sample budgets, indicating the generality of the sG-MDP approach.
Implications and Discussion
The introduction of sG-MDPs represents a significant step in formalizing the process of dynamic subgoal generation in ATP. By enabling agents to structure their own search space through conjecture creation, the framework provides a mechanism for reward shaping that is both principled and compatible with formal verification. This approach aligns with human mathematical practice, where intermediate lemmas and conjectures are essential for managing complexity in long proofs.
From a practical perspective, the modularity of Bourbaki allows for straightforward integration of additional or improved LLMs, and the reliance on formal verification ensures soundness of generated proofs. The use of MCTS, guided by Lean-verifier feedback rather than learned critics, simplifies implementation and avoids the challenges associated with critic training in high-dimensional, sparse-reward environments.
The results on PutnamBench suggest that structured, subgoal-driven exploration is a key enabler for scaling ATP to more complex domains. The observed improvements in both proof success rates and proof diversity have direct implications for downstream applications in formal verification, mathematical discovery, and education.
Future Directions
Several avenues for further research are suggested by this work:
- Value Function Design: While the current implementation uses depth-based and conjecture-count metrics, integrating learned value functions or more sophisticated reward shaping could further enhance search efficiency.
- Model Scaling and Ensembling: Extending the framework to larger models or more diverse ensembles may yield further gains, especially as LLM capabilities continue to improve.
- Generalization to Other Formal Systems: The sG-MDP framework is agnostic to the underlying proof assistant and could be adapted to other formal languages (e.g., Coq, Isabelle).
- Automated Curriculum Generation: The dynamic subgoal mechanism could be leveraged for automated curriculum learning, where the system generates increasingly challenging conjectures to bootstrap its own capabilities.
Conclusion
Bourbaki demonstrates that self-generated, goal-conditioned MDPs provide a robust foundation for ATP in environments characterized by sparse rewards and long horizons. The empirical results establish new benchmarks for 7B-scale models on PutnamBench and highlight the importance of structured, subgoal-driven search in formal reasoning. The framework’s modularity and reliance on formal verification position it as a promising basis for future advances in automated mathematical reasoning and formal methods.