Papers
Topics
Authors
Recent
2000 character limit reached

MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

Published 11 Dec 2025 in cs.LG | (2512.10187v1)

Abstract: We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .

Summary

  • The paper presents the miniF2F-Dafny benchmark, recasting theorem proving into auto-active verification by combining SMT-based automation with high-level LLM guidance.
  • It demonstrates a 35% improvement over bare SMT automation, with top models achieving up to a 55.7% pass rate on challenging olympiad-level problems.
  • The study highlights the potential of integrating LLM reasoning with automated verification to simplify formal proofs and lower barriers in mathematical formalization.

LLM-Guided Mathematical Theorem Proving via Auto-Active Verification: The miniF2F-Dafny Benchmark

Introduction and Context

"MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification" (2512.10187) introduces an automation-oriented adaptation of the established miniF2F mathematical reasoning benchmark, translating it from interactive theorem provers (ITPs) to the auto-active verifier Dafny. The motivation centers on exploring whether auto-active paradigms—where SMT solvers mechanize significant proof search—enable modern LLMs to make nontrivial contributions, particularly when the requirement is shifted from low-level, stepwise proof scripting to high-level mathematical insight and strategic hint generation.

Prior miniF2F instances for Lean, Isabelle, and HOL Light necessitate manual proof scripts for each logical step, either from humans or highly specialized AI agents. This work recasts miniF2F in Dafny to leverage SMT-backed automation, investigating the resultant division of labor between LLM-generated proof hints and automated logical reasoning, and establishing new baselines for automated mathematical formalization.

Benchmark Design and Technical Approach

The miniF2F-Dafny benchmark comprises 488 mathematical problems (test and validation sets of 244 problems each), translated systematically from Lean to Dafny’s input language. Each problem is a Dafny lemma with full signature, precisely specified preconditions (requires clauses), and postconditions (ensures clauses), but an empty proof body. The translation preserves mathematical diversity over algebra, combinatorics, number theory, inequalities, and analysis, reflecting the source benchmark’s composition (AIME, AMC, IMO, and undergraduate material).

To support SMT-automated verification, two foundational files are defined: definitions.dfy (axiomatizing integers, rationals, reals, complex numbers, and standard operations) and library.dfy (108 axiomatized lemmas spanning properties of exponentials, polynomials, trigonometric identities, complex analysis, number theory, and fundamental inequalities). The axiomatization is intentionally minimal, avoiding large-scale library development and focusing evaluation on proof synthesis and the interaction between SMT automation and LLM-generated guidance.

The evaluation pipeline prioritizes stringent adherence to original specifications—prohibiting unsound additions such as weakened postconditions, omission or modification of preconditions, or circumvention via assume statements or {:axiom} exploitation in solutions.

Baseline Results: Automation and LLM-Steered Proof Synthesis

SMT-Backed Automation

Running Dafny’s verifier (version 4.11.0, Z3 backend, 5 attempts, 30s timeout), 40.6% of the test set and 44.7% of the validation set (99/244 and 109/244 problems, respectively) are verified with empty proof bodies. This demonstrates that for a substantial fraction of olympiad-level mathematics, generic SMT-based automation—supported solely by minimal axiomatic infrastructure—can discharge all verification obligations with no human or AI input for proof construction. Notably, problems requiring multi-step, intricate proofs in Lean can collapse to the trivial case in Dafny.

LLM-Guided Proof Hints

On the remaining problems, 12 off-the-shelf LLMs are evaluated for their ability to fill the proof body with hints, assertions, intermediate lemmas, or algebraic calculations such that Dafny’s automation verifies the result. Each problem receives up to 4 generations per model, including up to 3 rounds of error-correction feedback based on verification diagnostics—a simulated iterative proof search process.

The top-performing model (Claude Sonnet 4.5) achieves a pass@4 of 55.7% on the test set—a 35% improvement over bare SMT automation. The next-best models are Claude Sonnet 3.7 (55.2%) and Qwen 3 235B Mixture-of-Experts (54.3%), with typical models clustering in the 43–50% range. Leaderboard results indicate that highly capable models, even when not fine-tuned on Dafny/theorem-proving data, can provide substantive mathematical guidance that interfaces productively with automated verification. Contrasting models (Llama, DeepSeek, GPT-OSS) often falter due to poor Dafny idiom familiarity or inability to exploit verification-oriented constructs such as calc statements or ghost variables.

Examples exhibit concise, effective LLM-generated reasoning (e.g., parity arguments, algebraic manipulations triggering SMT discharge) as well as more sophisticated proof strategies (e.g., auxiliary lemma synthesis, structured decompositions such as sum-of-squares). These outputs frequently resemble “explanatory outlines” rather than strictly formal script enumerations, which match the automation-centric setting.

Error Analysis

Three dominant sources of failure are identified:

  • Verification Brittleness: Minor syntactic or structural divergences (e.g., assertion ordering, structure of proofs) can impede SMT automation, resulting in verification failures for semantically correct yet verifier-unfriendly scripts.
  • Insufficient Language-Specific Data: LLMs often conflate Dafny with other ITP syntaxes, produce semantically vacuous idioms, or misuse verification primitives due to low exposure in pretraining corpora.
  • Mathematical Scope Limitations: Problems requiring facts or lemmas outside the minimal library hit unavoidable obstacles, necessitating creative theory extension within the solution proof itself.

Relation to Prior Work

The work draws a clear distinction from the dominant orientation of prior theorem-proving benchmarks, which are grounded in interactive theorem provers and necessitate granular proof term construction. Recent advances (Seed-Prover [seed-prover], HILBERT [hilbert], Aristotle (Achim et al., 1 Oct 2025)) deliver >99% accuracy on Lean miniF2F via ensemble agentic frameworks, massive formal corpora pretraining, and elaborate search and verification constructs. By contrast, miniF2F-Dafny is the first to instantiate the benchmark in an auto-active verification language, shifting the focus from proof search to proof synthesis under SMT-backed automation, thereby redefining the interface between LLMs and formal reasoning engines.

The evaluation also contrasts with prior attempts at program synthesis in Dafny (Poesia et al., 2024, Loughridge et al., 2024), where success rates have historically lagged, and underscores the crucial impact of verification pipeline soundness on benchmark integrity.

Implications and Future Directions

The clear demonstration of division of labor—LLMs outlining high-level insight, SMT solvers handling low-level detail—suggests that auto-active paradigms can lower the barrier to practical AI-augmented formalization in mathematics. This has direct implications for mathematical knowledge management, code and proof co-verification, and the integration of AI guidance within design-by-contract workflows.

Near-term research avenues include:

  • Dafny-specific Pretraining and Fine-Tuning: To bridge idiomatic knowledge gaps, dedicated pretraining and reinforcement learning on verification artifacts will likely yield improvements.
  • Agentic Proof Search: Adapting agent-based orchestrators (as in Seed-Prover/HILBERT) to the auto-active context, potentially fusing dynamic lemma synthesis and auto-caching, may close the remaining gap with ITP performance.
  • Automated Lemma Induction and Retrieval: Leveraging strategies from neural proof retrieval (LEGO-Prover (Wang et al., 2023)) and lemma mining may enable cumulative learning over the benchmark.
  • Benchmark Library Expansion: Extending the axiomatized mathematical corpus would allow meaningful evaluation at higher mathematical sophistication and diminish bottlenecks from unprovable obligations.

Beyond these, the synergies between auto-active verification and interactive proving may drive architectural convergence, as evidenced by trends such as Lean’s integration of automation-focused grind tactics.

Conclusion

The miniF2F-Dafny benchmark operationalizes the first rigorous evaluation of pure mathematical reasoning under the auto-active paradigm, demonstrating a productive interface between LLMs and SMT-powered automation. The benchmark’s structure incentivizes concise, high-level mathematical guidance from LLMs, delegating routine derivational work to the verifier. This architecture highlights the practical potential for broader adoption of AI-assisted formalization in mathematics and verification, and sets a foundation for future research bridging the gap between interactive and automation-centric paradigms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.