OProver: A Unified Framework for Agentic Formal Theorem Proving

Published 17 May 2026 in cs.CL and cs.AI | (2605.17283v1)

Abstract: Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces an agentic proving policy that leverages retrieval-enhanced, feedback-conditioned multi-round iterative proof repair to significantly improve Lean theorem proving performance.
It constructs a trajectory-level Lean corpus (OProofs) and employs continuous post-training to integrate new verified proofs with repair signals.
Empirical evaluations across multiple benchmarks demonstrate that combining retrieval and detailed compiler feedback leads to notable improvements in Pass@32 metrics.

OProver: A Unified Framework for Agentic Formal Theorem Proving

Introduction

OProver introduces an integrated agentic formal theorem proving framework for Lean 4 that fundamentally reconfigures both data construction and training procedures. The system unifies retrieval, compiler feedback, and iterative proof repair, aiming to mitigate the train-inference divergence afflicting prior approaches, where retrieval and environment feedback are mostly deployed as inference-time heuristics rather than incorporated into the prover's policy during training. The central contribution is the agentic proving policy trained end-to-end to execute bounded multi-round refinement, leveraging retrieved compiler-verified proofs and Lean diagnostics at each iteration. This approach constructs OProofs, a large-scale trajectory-level Lean corpus, and couples model performance with evolving data quality through iterative post-training and corpus recirculation.

Unified Agentic Proving Framework

OProver formalizes agentic proving as a multi-round loop wherein the policy $\pi$ conditions on the target statement $s$ , retrieval memory $\mathcal{R}_t$ (top- $k$ relevant compiler-verified proofs), previous proof attempt $p_{t-1}$ , and Lean feedback $f_{t-1}$ . At each round, the policy generates a revised proof attempt, which is verified by Lean. The process terminates upon successful verification or exhaustion of the round budget. Critically, the interaction state is kept compact, conditioning only on the most recent attempt and feedback, facilitating targeted repair and preventing excessive prompt growth. Retrieval memory is indexed from OProofs and expanded continuously during post-training.

Figure 1: Structural overview of OProver, illustrating OProofs Construction, agentic proving with retrieval/feedback, and agentic training with CPT, SFT, RL, and data recirculation.

Compiler feedback is provided in raw textual form rather than categorical abstraction, ensuring fine-grained failure diagnostics are utilized directly for proof revision. Retrieval augmentation exposes reusable lemma and tactic structures from related contexts without rigid template constraints, increasing proof flexibility. This synergy between feedback and retrieval is integral to OProver's robustness across diverse theorem types.

OProofs: Corpus Construction and Characteristics

OProofs delivers a trajectory-level Lean corpus, superseding traditional datasets restricted to final verified proofs. The corpus comprises 1.77M unique Lean statements, 6.86M compiler-verified proofs, 1.07M agentic proving trajectories, and over 280K round-level repair supervised instances. Data is sourced from public Lean resources, large-scale proof synthesis (including autoformalization and compiler verification), and ongoing agentic proving traces. Each trajectory records retrieved context, failed attempts, feedback, and subsequent repairs, supporting both supervised and RL objectives. Domain and difficulty annotation is performed via LLM classification, establishing coverage in Algebra, Analysis, Number Theory, Geometry, and supporting domains; difficulty tiers span Elementary to GraduatePlus.

Figure 2: Corpus overview summarizing unique statements, proofs, trajectory repair instances, domain distribution, and difficulty levels.

Crucially, OProofs is not static. As OProver advances, newly verified proofs and repair traces are continuously assimilated, building data recirculation into the iterative post-training pipeline. This co-evolution paradigm is essential for sustaining improvements in both retrieval context and supervision signal.

Training Pipeline: Continued Pretraining and Iterative Post-Training

OProver is initialized via continued pretraining (CPT) on a 65B-token mixture emphasizing Lean formal data (from OProofs), code corpora, mathematical reasoning datasets, and chain-of-thought expansions, resulting in a domain-adapted OProver-Base. Iterative post-training alternates agentic proving rollouts, supervised fine-tuning (SFT) on repair trajectories, and RL via Group Sequence Policy Optimization (GSPO) with group-relative normalization. Newly verified proofs enhance OProofs and the retrieval memory, with harder unresolved cases feeding RL signal.

Supervised fine-tuning leverages explicit round-level repair supervision, strictly matching training and deployment interfaces. Reinforcement learning optimizes group-wise advantages, pooling rounds from multiple rollouts on each theorem, permitting cross-comparative credit assignment. The training interface and reward formulation are tuned for maximal compatibility with agentic interaction.

Empirical Evaluation and Results

OProver is evaluated on five Lean 4 theorem-proving benchmarks: MiniF2F, MathOlympiadBench, ProofNet, ProverBench, PutnamBench. Comparisons include state-of-the-art open-weight reasoning models (DeepSeek-V3.2, Kimi-K2.5) and open-weight formal provers (Goedel-Prover-V2, LongCat-Flash-Prover).

Figure 3: Pass@32 performance comparisons on MiniF2F-Test, PutnamBench, and MathOlympiadBench, with OProver achieving consistent state-of-the-art or near state-of-the-art results.

OProver-32B attains the highest Pass@32 in MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second in MathOlympiadBench (22.8%) and ProofNet (33.2%), outperforming models with significantly greater parameter counts (e.g., LongCat-Flash-Prover, DeepSeek-Prover-V2). The performance edge is not model scale-driven; instead, it arises from integrated retrieval, feedback-conditioned multi-round repair, and process-level post-training. Even OProver-8B surpasses Goedel-Prover-V2-32B, demonstrating scalability of the method.

OProver's capacity to benefit from increased test-time compute is confirmed by consistent improvements in BestPass at fixed budget $B$ , with diminishing marginal returns as budget grows, and clear benchmark-dependent allocation optima between refinement depth and sampling width.

Figure 4: OProver test-time performance scaling under fixed compute budget $B$ , showing monotonic gains with increased resources and diminishing returns at higher $B$ .

Figure 5: Performance tradeoff between refinement depth $R$ and sampling width under fixed total budget, illustrating benchmark-specific allocation optima.

Iterative post-training yields monotonic improvements per iteration, confirming that continued infusion of newly verified proofs and agentic repair traces directly augments proving capability.

Component Ablation: Feedback and Retrieval

Ablation studies evidence the dominant role of multi-turn compiler feedback in driving performance, with retrieval providing additional gains. Removing feedback consistently degrades Pass@32 by 5--8 points across benchmarks and model sizes; further removing retrieval yields incremental drops (~1--2 points). The combined results confirm that OProver's improvement arises from the synergy of retrieval-augmented grounding and feedback-conditioned iterative repair, not single-round sampling.

Implications and Future Directions

OProver substantiates the theoretical pragmatic benefits of agentic proving, trajectory-level process supervision, and co-evolutionary training. The framework demonstrates scalability on formal mathematics benchmarks and establishes retrieval and feedback conditioning as essential for robust proof search and repair. On the theoretical side, OProofs serves as an archetype for process-level formal corpora, bridging the gap between static proof artifacts and dynamic interactive supervision.

Practically, OProver's unified policy, with recirculating agentic training loops, presents a scalable approach for enhancing formal theorem proving in Lean and other proof assistants. The methodology invites further study into abstraction-aware retrieval, multi-agent collaboration in agentic proving, and full-cycle formal-mathematical data synthesis pipelines. As iterated refinement and feedback integration matures, future AI systems may better approximate expert-level mathematical reasoning with verifier-grounded reliability.

Conclusion

OProver delivers a comprehensive agentic formal theorem proving framework for Lean 4, unifying retrieval, feedback, and iterative repair within training and inference. By constructing the OProofs corpus with trajectory-level supervision and enabling co-evolution between data and policy, OProver advances formal reasoning performance and scales test-time efficiency robustly. Empirical results validate the framework across diverse benchmarks, and ablation studies reinforce the criticality of agentic refinement and retrieval grounding. The architecture and corpus establish a foundation for further advances in proof-oriented AI, formal verification, and automated mathematical reasoning.

Markdown Report Issue