Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving (2507.02726v1)

Published 3 Jul 2025 in cs.AI and cs.LG

Abstract: Reasoning remains a challenging task for LLMs, especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.

Summary

The paper introduces a self-generated goal-conditioned MDP framework that dynamically creates subgoals to combat sparse rewards in theorem proving.
It employs ensembles of 7B-parameter language models and Monte Carlo Tree Search to improve search efficiency and proof diversity on challenging benchmarks.
The system integrates modularly with Lean, achieving new state-of-the-art results and demonstrating enhanced sample efficiency on the PutnamBench dataset.

Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

The paper presents Bourbaki, a modular automated theorem proving (ATP) system that introduces the self-generated goal-conditioned Markov Decision Process (sG-MDP) framework for formal mathematical reasoning. The work addresses the persistent challenge of sparse rewards and long proof horizons in ATP, particularly in the context of university-level problems as formalized in the PutnamBench dataset. The authors propose a principled approach to subgoal generation and search, leveraging ensembles of 7B-parameter LLMs and Monte Carlo Tree Search (MCTS) to achieve new state-of-the-art results at this scale.

Problem Setting and Motivation

Automated theorem proving in formal systems such as Lean is a demanding benchmark for machine reasoning, requiring multi-step logical inference and the ability to decompose complex problems. While LLMs have demonstrated some capacity for mathematical reasoning, their performance is limited by the combinatorial explosion of proof spaces and the sparsity of reward signals—most proof attempts yield no feedback until a complete proof is found. This is especially problematic in datasets like PutnamBench, which contains challenging, multi-step problems from the Putnam Mathematical Competition.

The authors identify that human mathematicians naturally decompose proofs into intermediate subgoals, providing denser feedback and structuring the search space. Existing goal-conditioned reinforcement learning (GCRL) frameworks are insufficient for this setting, as they assume a fixed set of goals. The need is for a system that can dynamically generate and pursue subgoals based on the evolving proof state.

Self-Generated Goal-Conditioned MDPs (sG-MDPs)

The core theoretical contribution is the formalization of sG-MDPs, which extend standard goal-conditioned MDPs by allowing the agent to generate its own subgoals during the search process. In this framework:

The action space is augmented to include both primitive proof actions (e.g., Lean tactics) and subgoal-generation actions (e.g., conjecturing intermediate lemmas).
The transition function is extended to operate on both proof states and the stack of active goals.
The reward function is designed to provide feedback not only for solving the main theorem but also for independently verifiable subgoals, enabling denser and more informative reward signals.

This formulation enables a smoother optimization landscape and supports more effective exploration strategies in the proof search space.

Instantiation in Lean and Proof Search

The sG-MDP framework is instantiated in the Lean 4 proof assistant, with proof states represented as token sequences and actions corresponding to tactic applications or conjecture introductions. The system leverages PyPantograph for tactic validation and subgoal management, enabling dynamic manipulation of the proof state and subgoal stack.

Proof search is conducted using an MCTS variant tailored to the sG-MDP structure. The search tree nodes represent (state, goal stack) pairs, and the policy is conditioned on the current subgoal. The value function is computed using only verifiable outcomes (i.e., Lean-validated proofs of subgoals), eschewing learned critics. The reward structure can be flexibly designed to promote depth, number of solved conjectures, or other metrics.

System Architecture and Implementation

Bourbaki is implemented as a modular system capable of ensembling multiple LLMs for both subgoal generation and tactic synthesis. In the reported experiments, DeepSeek-Prover-v2–7B and Kimina-7B serve as the base models, with vLLM used for efficient batch inference. The system is integrated with Pantograph and Lean 4.20.1 for proof validation.

Key implementation details include:

Up to 10 tactic candidates are considered per node, with a maximum of 512 MCTS iterations per problem.
The value function combines depth-based metrics and the number of solved conjectures, providing denser feedback than terminal-only rewards.
Heuristic tactics that yield multiple candidate actions are handled by deferring to the base model for completion, with ongoing work to improve tactic validation and proof soundness.

Empirical Results

Bourbaki is evaluated on the PutnamBench dataset, which comprises 658 formalized Putnam problems in Lean 4. The main results are as follows:

Bourbaki (7B) solves 26/658 problems at pass@512, surpassing the previous best 7B result of 23/658 (DeepSeek-Prover-v2–7B at pass@1024) and the prior leaderboard best of 10/644 (Kimina-7B).
The system demonstrates improved sample efficiency, solving more problems with fewer model queries.
The approach also increases proof diversity, generating a greater variety of correct proofs per theorem compared to base models.
When applied as a wrapper to other provers (e.g., STP, DeepSeek-Prover-v2), the sG-MDP framework consistently improves success rates under fixed sample budgets.

The results substantiate the claim that structured, subgoal-driven exploration via sG-MDPs leads to more effective and efficient proof search in formal mathematics, even at moderate model scales.

Implications and Future Directions

The introduction of sG-MDPs provides a generalizable framework for integrating dynamic subgoal generation into ATP systems. The empirical gains in both success rate and proof diversity suggest that denser reward structures and structured exploration are critical for scaling formal reasoning with LLMs. The modularity of Bourbaki allows for straightforward integration of larger or more specialized models, and the framework is compatible with a range of search and value-estimation strategies.

Potential future developments include:

Scaling the approach to larger models and more diverse ensembles.
Integrating learned critics or value functions to further guide search.
Extending the framework to other formal systems beyond Lean.
Investigating curriculum learning and self-play within the sG-MDP paradigm.

The work advances the state of the art in LLM-based theorem proving and provides a foundation for further research into structured, goal-driven reasoning in AI. The sG-MDP framework is likely to inform future developments in both formal mathematics and broader domains requiring hierarchical, multi-step reasoning.