- The paper presents the miniF2F-Dafny benchmark, recasting theorem proving into auto-active verification by combining SMT-based automation with high-level LLM guidance.
- It demonstrates a 35% improvement over bare SMT automation, with top models achieving up to a 55.7% pass rate on challenging olympiad-level problems.
- The study highlights the potential of integrating LLM reasoning with automated verification to simplify formal proofs and lower barriers in mathematical formalization.
LLM-Guided Mathematical Theorem Proving via Auto-Active Verification: The miniF2F-Dafny Benchmark
Introduction and Context
"MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification" (2512.10187) introduces an automation-oriented adaptation of the established miniF2F mathematical reasoning benchmark, translating it from interactive theorem provers (ITPs) to the auto-active verifier Dafny. The motivation centers on exploring whether auto-active paradigms—where SMT solvers mechanize significant proof search—enable modern LLMs to make nontrivial contributions, particularly when the requirement is shifted from low-level, stepwise proof scripting to high-level mathematical insight and strategic hint generation.
Prior miniF2F instances for Lean, Isabelle, and HOL Light necessitate manual proof scripts for each logical step, either from humans or highly specialized AI agents. This work recasts miniF2F in Dafny to leverage SMT-backed automation, investigating the resultant division of labor between LLM-generated proof hints and automated logical reasoning, and establishing new baselines for automated mathematical formalization.
Benchmark Design and Technical Approach
The miniF2F-Dafny benchmark comprises 488 mathematical problems (test and validation sets of 244 problems each), translated systematically from Lean to Dafny’s input language. Each problem is a Dafny lemma with full signature, precisely specified preconditions (requires clauses), and postconditions (ensures clauses), but an empty proof body. The translation preserves mathematical diversity over algebra, combinatorics, number theory, inequalities, and analysis, reflecting the source benchmark’s composition (AIME, AMC, IMO, and undergraduate material).
To support SMT-automated verification, two foundational files are defined: definitions.dfy (axiomatizing integers, rationals, reals, complex numbers, and standard operations) and library.dfy (108 axiomatized lemmas spanning properties of exponentials, polynomials, trigonometric identities, complex analysis, number theory, and fundamental inequalities). The axiomatization is intentionally minimal, avoiding large-scale library development and focusing evaluation on proof synthesis and the interaction between SMT automation and LLM-generated guidance.
The evaluation pipeline prioritizes stringent adherence to original specifications—prohibiting unsound additions such as weakened postconditions, omission or modification of preconditions, or circumvention via assume statements or {:axiom} exploitation in solutions.
Baseline Results: Automation and LLM-Steered Proof Synthesis
SMT-Backed Automation
Running Dafny’s verifier (version 4.11.0, Z3 backend, 5 attempts, 30s timeout), 40.6% of the test set and 44.7% of the validation set (99/244 and 109/244 problems, respectively) are verified with empty proof bodies. This demonstrates that for a substantial fraction of olympiad-level mathematics, generic SMT-based automation—supported solely by minimal axiomatic infrastructure—can discharge all verification obligations with no human or AI input for proof construction. Notably, problems requiring multi-step, intricate proofs in Lean can collapse to the trivial case in Dafny.
LLM-Guided Proof Hints
On the remaining problems, 12 off-the-shelf LLMs are evaluated for their ability to fill the proof body with hints, assertions, intermediate lemmas, or algebraic calculations such that Dafny’s automation verifies the result. Each problem receives up to 4 generations per model, including up to 3 rounds of error-correction feedback based on verification diagnostics—a simulated iterative proof search process.
The top-performing model (Claude Sonnet 4.5) achieves a pass@4 of 55.7% on the test set—a 35% improvement over bare SMT automation. The next-best models are Claude Sonnet 3.7 (55.2%) and Qwen 3 235B Mixture-of-Experts (54.3%), with typical models clustering in the 43–50% range. Leaderboard results indicate that highly capable models, even when not fine-tuned on Dafny/theorem-proving data, can provide substantive mathematical guidance that interfaces productively with automated verification. Contrasting models (Llama, DeepSeek, GPT-OSS) often falter due to poor Dafny idiom familiarity or inability to exploit verification-oriented constructs such as calc statements or ghost variables.
Examples exhibit concise, effective LLM-generated reasoning (e.g., parity arguments, algebraic manipulations triggering SMT discharge) as well as more sophisticated proof strategies (e.g., auxiliary lemma synthesis, structured decompositions such as sum-of-squares). These outputs frequently resemble “explanatory outlines” rather than strictly formal script enumerations, which match the automation-centric setting.
Error Analysis
Three dominant sources of failure are identified:
- Verification Brittleness: Minor syntactic or structural divergences (e.g., assertion ordering, structure of proofs) can impede SMT automation, resulting in verification failures for semantically correct yet verifier-unfriendly scripts.
- Insufficient Language-Specific Data: LLMs often conflate Dafny with other ITP syntaxes, produce semantically vacuous idioms, or misuse verification primitives due to low exposure in pretraining corpora.
- Mathematical Scope Limitations: Problems requiring facts or lemmas outside the minimal library hit unavoidable obstacles, necessitating creative theory extension within the solution proof itself.
Relation to Prior Work
The work draws a clear distinction from the dominant orientation of prior theorem-proving benchmarks, which are grounded in interactive theorem provers and necessitate granular proof term construction. Recent advances (Seed-Prover [seed-prover], HILBERT [hilbert], Aristotle (Achim et al., 1 Oct 2025)) deliver >99% accuracy on Lean miniF2F via ensemble agentic frameworks, massive formal corpora pretraining, and elaborate search and verification constructs. By contrast, miniF2F-Dafny is the first to instantiate the benchmark in an auto-active verification language, shifting the focus from proof search to proof synthesis under SMT-backed automation, thereby redefining the interface between LLMs and formal reasoning engines.
The evaluation also contrasts with prior attempts at program synthesis in Dafny (Poesia et al., 2024, Loughridge et al., 2024), where success rates have historically lagged, and underscores the crucial impact of verification pipeline soundness on benchmark integrity.
Implications and Future Directions
The clear demonstration of division of labor—LLMs outlining high-level insight, SMT solvers handling low-level detail—suggests that auto-active paradigms can lower the barrier to practical AI-augmented formalization in mathematics. This has direct implications for mathematical knowledge management, code and proof co-verification, and the integration of AI guidance within design-by-contract workflows.
Near-term research avenues include:
- Dafny-specific Pretraining and Fine-Tuning: To bridge idiomatic knowledge gaps, dedicated pretraining and reinforcement learning on verification artifacts will likely yield improvements.
- Agentic Proof Search: Adapting agent-based orchestrators (as in Seed-Prover/HILBERT) to the auto-active context, potentially fusing dynamic lemma synthesis and auto-caching, may close the remaining gap with ITP performance.
- Automated Lemma Induction and Retrieval: Leveraging strategies from neural proof retrieval (LEGO-Prover (Wang et al., 2023)) and lemma mining may enable cumulative learning over the benchmark.
- Benchmark Library Expansion: Extending the axiomatized mathematical corpus would allow meaningful evaluation at higher mathematical sophistication and diminish bottlenecks from unprovable obligations.
Beyond these, the synergies between auto-active verification and interactive proving may drive architectural convergence, as evidenced by trends such as Lean’s integration of automation-focused grind tactics.
Conclusion
The miniF2F-Dafny benchmark operationalizes the first rigorous evaluation of pure mathematical reasoning under the auto-active paradigm, demonstrating a productive interface between LLMs and SMT-powered automation. The benchmark’s structure incentivizes concise, high-level mathematical guidance from LLMs, delegating routine derivational work to the verifier. This architecture highlights the practical potential for broader adoption of AI-assisted formalization in mathematics and verification, and sets a foundation for future research bridging the gap between interactive and automation-centric paradigms.