Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 75 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 97 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 455 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

DafnyBench: A Benchmark for Formal Software Verification (2406.08467v1)

Published 12 Jun 2024 in cs.SE, cs.AI, cs.LG, and cs.PL

Abstract: We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces DafnyBench, the largest benchmark with 782 Dafny programs and 53,000 lines, enabling rigorous training and evaluation of LLMs for formal verification.
It defines a task where models must regenerate compiler hints to pass verification, maintaining original pre- and postconditions without using bypass constructs.
Results reveal Claude 3 Opus and GPT-4 Turbo achieving success rates of 68% and 59.8%, underscoring both progress and challenges in automated formal software verification.

Overview of DafnyBench: A Benchmark for Formal Software Verification

The paper "DafnyBench: A Benchmark for Formal Software Verification" introduces DafnyBench, the largest benchmark to date for training and evaluating machine learning systems aimed at formal software verification. This benchmark is essential for bridging the gap between rapid software development facilitated by LLMs and the need for ensuring that automatically generated or assisted code meets rigorous specifications. DafnyBench stands out in its scale, comprehensiveness, and the implications it bears for the future of AI-aided code verification processes.

Dataset Description

DafnyBench consists of 782 Dafny programs encompassing approximately 53,000 lines of code. These programs collectively represent a significant expansion over previous benchmarks like Clover and dafny-synthesis, which included only 66 and 153 programs, respectively. The dataset sources programs from GitHub and translates some from the MBPP benchmark to Dafny, covering a diverse array of programming constructs and verification challenges. The complexity of the programs in DafnyBench surpasses that of Clover, with multiple methods, functions, and lemmas per program, aiming to represent real-world scenarios more effectively.

Task and Evaluation Metric

The primary task proposed by the benchmark is to regenerate compiler hints (assert and invariant statements) that ensure the programs pass the Dafny verification engine. This challenging task assesses the model's ability to understand and generate formal verification constructs accurately. The evaluation metric is straightforward yet rigorous: a program is considered successfully verified if the generated hints ensure the program passes all verification checks without modifying the provided preconditions (requires statements) and postconditions (ensures statements), and without the use of cheating constructs like {:verify false} or assume false.

Results and Analysis

Five LLMs were tested on the DafnyBench dataset: GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, Claude 3 Opus, and CodeLlama-7b-Instruct-hf. Claude 3 Opus achieved the highest success rate at approximately 68%, followed by GPT-4 Turbo at 59.8%, indicating substantial progress but also highlighting considerable room for improvement. Notably, the success rate for the best-performing model improved significantly with retry attempts incorporating error message feedback, although the improvement plateaued after several attempts, suggesting a difficulty in utilizing error feedback effectively.

The results indicated that the models struggled more with longer programs and those requiring more extensive hint text. This finding underscores the increasing complexity and dependency management challenges that come with larger codebases, echoing real-world software development issues.

Implications and Future Work

DafnyBench's scale and detailed evaluations offer several implications for both practical and theoretical advancements in AI for formal verification:

Practical Implications:
- Automation Potential: The benchmark opens pathways for LLMs to automate significant portions of the formal verification process, potentially reducing the cost and time required for ensuring software correctness.
- Tool Development: The insights from DafnyBench can inform the development of more sophisticated verification tools that can integrate seamlessly into existing development workflows, automating hint generation and error correction.
Theoretical Implications:
- Benchmark Expansion: The need for even larger benchmarks is evident, with future expansions incorporating more varied programming paradigms and complexities to ensure comprehensive model training and evaluation.
- Model Improvement: There is potential for developing new algorithms and fine-tuning methods focused on formal verification tasks, enabling models to handle dependencies and logical reasoning more effectively.

Conclusion

DafnyBench represents a significant step forward in the field of formal software verification, providing a robust and extensive dataset for training and evaluating LLMs. The benchmark's results highlight current limitations and offer a clear pathway for future research, emphasizing the potential for AI models to transform software verification processes. As formal verification techniques and LLM capabilities continue to evolve, benchmarks like DafnyBench will be instrumental in driving the adoption of automated, reliable, and efficient verification methods in software development. The paper provides a solid foundation for further exploration and development, aiming ultimately for fully automated formal verification integrated seamlessly into software compilers.