A benchmark for vericoding: formally verified program synthesis (2509.22908v1)

Published 26 Sep 2025 in cs.SE, cs.LG, and cs.PL

Abstract: We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark

Summary

The paper introduces a benchmark for vericoding with 12,504 formal specifications across Dafny, Verus/Rust, and Lean.
The paper demonstrates vericoding success rates of 82.2% for Dafny, 44.2% for Verus, and 26.8% for Lean, highlighting rapid improvements in LLM capabilities.
The paper employs iterative translation, ensemble methods, and verification feedback to optimize formally verified program synthesis.

A Benchmark for Vericoding: Formally Verified Program Synthesis

Introduction and Motivation

This paper introduces a comprehensive benchmark for "vericoding," defined as the synthesis of formally verified code from formal specifications, in contrast to "vibe coding," which generates potentially buggy code from natural language descriptions. The benchmark comprises 12,504 formal specifications across Dafny, Verus/Rust, and Lean, with 6,174 new, previously unseen problems. The motivation is to address the limitations of vibe coding, where correctness is only probabilistically assured via test cases, and to leverage recent advances in AI-assisted formal verification to automate the generation of bug-free software. The work is situated in the context of rapid progress in automated theorem proving, but notes the lack of large, diverse benchmarks for formal verification and vericoding.

Benchmark Construction

The benchmark targets both automated theorem provers (ATPs) such as Dafny and Verus, and interactive theorem provers (ITPs) such as Lean. Tasks are sourced from existing verification and vibe coding benchmarks (e.g., DafnyBench, APPS, HumanEval, FVAPPS, CLEVER, VERINA), as well as mathematical library documentation (e.g., Numpy, BigNum). Each task consists of a formal specification, context, and optionally documentation, with implementations and proofs removed to create vericoding challenges. Specifications are translated between languages using an LLM-based iterative translation and repair loop, with verification feedback guiding corrections. Quality control includes LLM-based and manual validation, near-duplicate detection via sentence transformers and FAISS, and metadata annotation for quality metrics.

Vericoding Process and Experimental Setup

The vericoding pipeline presents annotated prompts to LLMs, requesting code and proof blocks for each task. Generated blocks are validated for cheating patterns (e.g., bypassing verification via assume(false) or sorry), inserted into task templates, and checked with language-specific proof checkers. Error messages are fed back to the LLM for iterative correction, with a fixed number of attempts per task. Prompts are tailored for each language, and ensemble methods ("model union") aggregate successes across models. Manual inspection quantifies the validity of outputs and identifies language-specific issues.

Results and Analysis

Vericoding success rates vary significantly by language: Dafny achieves 82.2%, Verus 44.2%, and Lean 26.8% (model union). The best-performing models are Claude-Opus-4.1 for Dafny and GPT-5 for Verus and Lean. The lower success rates for Verus are attributed to its complex type system and less mature LLM training data, while Lean's challenges stem from LLMs being primarily trained on mathematical theorem proving rather than code verification.

Notably, formal verification success on DafnyBench has increased from 68% to 96% in one year, indicating rapid LLM progress. Including natural language descriptions ("vibe" information) in prompts does not significantly improve vericoding performance. Ensemble methods suggest potential for mixture-of-experts strategies, where different models tackle subproblems in parallel.

Figure 1: Vericoding success as a function of task spec length (top), generated code length (middle), and spec ratio (bottom), sorted by size.

Analysis of task parameters reveals that spec length is a weak predictor of vericoding difficulty, while longer solution code correlates with lower success rates, consistent with increased error probability in larger implementations. Spec ratio trends mirror those of solution length.

Limitations and Failure Modes

The benchmark includes tasks with incomplete or weak specifications, which sometimes admit trivial or unintended solutions. LLMs occasionally exploit these weaknesses, as illustrated by cases where returning an empty list satisfies the proof checker due to underspecified postconditions. Manual inspection finds that, conditioned on vericoding success, approximately 9% of specs are too weak and 15% have poor translations, but these still constitute valid vericoding tasks.

Implications and Future Directions

The results demonstrate that LLMs can synthesize formally verified code for a substantial fraction of tasks, especially in ATPs like Dafny. The rapid improvement in verification rates suggests that vericoding success will continue to rise as LLMs are trained on more diverse and challenging datasets. The benchmark provides a foundation for further research in automated formal verification, including:

Prompt optimization and advanced search strategies (e.g., tree search, RL) to improve LLM performance.
Collaborative LLM networks or MoE architectures to leverage model diversity.
Extension to more complex benchmarks (e.g., SWE-bench, BashBench) and real-world software systems.
Integration of formal verification into compilers for automatic bug detection and repair.

The work complements concurrent efforts such as cslib, which expands the set of Lean tasks for LLM-based solution synthesis.

Conclusion

This benchmark establishes a new standard for vericoding, enabling rigorous evaluation of LLMs in formally verified program synthesis. The strong numerical results, particularly in Dafny, and the rapid progress in LLM capabilities, underscore the potential for AI to automate the generation of bug-free software. As AI-generated code becomes ubiquitous, formal verification will be essential for ensuring correctness and safety. The benchmark, dataset, and experimental framework released by the authors will facilitate future advances in AI-driven formal methods and program synthesis.