Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

Published 3 Jun 2026 in cs.CL and cs.LO | (2606.04883v1)

Abstract: LLMs are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $25.8\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a dynamic cost-quality control mechanism that adjusts compute allocation based on success likelihood.
It employs a modular, lemma-driven proof generation pipeline using trajectory-derived features for cost and accuracy estimation.
Empirical evaluations on PutnamBench demonstrate up to a 59.9% cost reduction with minimal accuracy loss, affirming the method's efficiency.

Optimizing the Cost-Quality Tradeoff in Agentic Lean Theorem Provers

Introduction

The proliferation of LLMs in automated formal theorem proving has significantly advanced the state of the art in systems such as Lean, enabling scalable, precise, and verifiable proof synthesis. However, state-of-the-art agentic theorem-proving frameworks, which recursively decompose problems and sample numerous attempts per subproblem, incur substantial computational cost, often allocating extensive resources to trajectories with little prospect of success. The paper "Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean" (2606.04883) addresses the inefficiency inherent in static, fixed-step allocation policies by proposing and empirically validating a routing-centric control mechanism for agentic Lean provers. This approach dynamically allocates compute based on the evolving likelihood of successful proof discovery, as inferred from the agent’s interaction trajectory with the theorem prover and Lean compiler feedback.

Formalization of the Agent Architecture

The authors introduce a modular agent architecture partitioned into a proof-generation data plane and a cost-quality control plane. The data plane follows prior agentic prover frameworks by hierarchically decomposing formal problems into lemmas, formalizing them, and recursively attempting proofs—each direct interaction with a verified Lean environment. The control plane operates orthogonally, observing the failure/success history within a proof attempt trajectory and extracting features such as proof similarity and error diversity to estimate both the expected cost and the probability of future success. At each routing juncture, the agent uses a cost-quality utility function

$\tau(s) = \hat{q}(s) - \lambda \hat{c}(s)$

where $\hat{q}(s)$ is the predicted next-attempt solve rate, $\hat{c}(s)$ is the expected compute cost, and $\lambda$ controls the cost-quality imbalance to enact an optimal stopping policy.

This design can be straightforwardly extended to richer agentic action spaces, subsuming actions such as model switching and self-correction, so long as suitable estimators for cost and utility are constructed per action.

(Figure 1)

Figure 1: Cost-quality curve comparison between the proposed agentic router, a fixed-step baseline, and a zero-noise oracle, demonstrating substantial resource savings at comparable accuracy.

Empirical Evaluation and Numerical Results

Experiments are conducted on an 85-problem subset of PutnamBench, using Goedel-Prover-V2-8B as the proof-generating backend. The adaptive routing agent is compared against a fixed-step baseline (fixed budget per subproblem) and an oracle agent with access to ground-truth success probability. The primary evaluation metric is SFLOPs, and the systems are benchmarked for both constant-performance and constant-budget regimes.

Key numerical results include:

At matched accuracy, the adaptive agent achieves a 25.8% reduction in compute cost relative to the fixed-step baseline.
For a fixed budget, the adaptive policy confers a 7.8% relative accuracy improvement.
With the proposed architecture, $43.6\%$ of problems are solved at $22.6$M SFLOPs, while the baseline requires $31.6$M SFLOPs to achieve $44.0\%$ accuracy—a 28.4% reduction in cost for negligible accuracy loss.
Comparative analysis against an oracle agent reveals further potential for gains: zero-noise oracle adaptive routing provides up to a 59.9% cost reduction.
Figure 3: Cost-quality curves revealing the operational efficiency of the agentic router compared to both fixed-step and oracle reference points.

Ablation studies confirm the significance of trajectory-derived features, with error diversity being especially salient for next-attempt success estimation.

Analysis of Data Plane Architectural Impact

The agent’s data plane—based on recursive lemma-driven decomposition (as opposed to monolithic, end-to-end proof generation)—is shown to confer substantial performance gains, even for a fixed inference budget. Using identical model capacity, the lemma-driven pipeline nearly matches the accuracy of a much larger 32B-parameter model configured for whole-proof generation, doubling the solve rate compared to an 8B whole-proof baseline.

Figure 2: Comparison between fixed-step data plane pipeline and whole-proof generation, highlighting superior performance due to the modular, decompositional prover architecture.

Figure 4: Whole-proof cost-quality curves on PutnamBench baseline for context; decompositional pipelines approach the accuracy region of significantly larger single-step models at lower compute cost.

Theoretical and Practical Implications

This work exposes the substantial inefficiency of static compute-allocation strategies in agentic theorem-proving systems and demonstrates that policy-based, trajectory-aware routing yields substantial savings without degrading problem-solving performance. Furthermore, the observation that even a simple proxy feature vector—incorporating proof similarity, error diversity, and attempt count—enables much of the theoretical benefit achievable by an oracle suggests significant opportunities for further optimization via richer trajectory representations and more expressive estimators.

Practically, this approach provides a pathway to reduce the prohibitive runtime cost associated with competitive formal reasoning agents ($40–50 per problem on PutnamBench in reported state-of-the-art), allowing for more scalable research, experimentation, and deployment.

Limitations and Prospective Future Directions

The study’s limitation lies primarily in computational cost, restricting empirical validation to a mid-scale slice of PutnamBench. Nonetheless, the generality of the agentic routing framework admits numerous future extensions:

Learned cost/utility estimation with deep models rather than linear regression.
Generalization to broader agent action spaces, including hierarchical action selection, model switching, fine-tuned self-correction, and depth-adaptive recursion.
Application to other formal systems and mathematical domains, leveraging runtime verifiability as an information source for control.

The explicit gap to oracle routing underscores the need for stronger, possibly learned trajectory embeddings and exploration of unsupervised or semi-supervised quality estimation.

Conclusion

The paper demonstrates that cost-quality-aware routing in agentic Lean theorem provers enables significant reductions in inference resource consumption at minimal or no loss in success rate. The decoupled agent architecture and trajectory-derived estimation framework are empirically validated and theoretically extensible, laying foundational methodology for future scalable and efficient agentic mathematical reasoning systems. The results suggest that adaptive control, grounded in observed trajectory statistics, should be the default design for cost-sensitive formal reasoning agents.

Markdown Report Issue