Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 455 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware (2410.23299v1)

Published 15 Oct 2024 in cs.AR and cs.AI

Abstract: The remarkable reasoning and code generation capabilities of LLMs have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels: from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without additional human input. As test instances, we present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with industrial FV workflows. A wide range of existing LLMs, both proprietary and open-source, are evaluated against FVEval, based on which we investigate where today's LLMs stand and how we might further enable their application toward improving productivity in digital FV. Our benchmark and evaluation code is available at \url{https://github.com/NVlabs/FVEval}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Multi-lingual evaluation of code generation models. ArXiv, abs/2210.14868, 2022.
  3. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021.
  4. Chip-chat: Challenges and opportunities in conversational hardware design. 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD), pages 1–6, 2023.
  5. Evaluating large language models trained on code, 2021.
  6. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.
  7. Bounded model checking using satisfiability solving. Formal methods in system design, 19:7–34, 2001.
  8. Design and synthesis of synchronization skeletons using branching time temporal logic. In Dexter Kozen, editor, Logics of Programs, pages 52–71, Berlin, Heidelberg, 1982. Springer Berlin Heidelberg.
  9. Automatic verification of finite-state concurrent systems using temporal logic specifications. ACM Trans. Program. Lang. Syst., 8(2):244–263, apr 1986.
  10. Large language models of code fail at completing code with potential bugs. ArXiv, abs/2306.03438, 2023.
  11. Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms. ArXiv, abs/2402.00386, 2024.
  12. Harry Foster. "the 2022 wilson research group ic/asic functional verification treads”. White Paper. Wilson Research Group and Mentor, A Siemens Business, 2022.
  13. Automated assertion generation from natural language specifications. 2020 IEEE International Test Conference (ITC), pages 1–5, 2020.
  14. Gemini Team Google. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023.
  15. Folio: Natural language reasoning with first-order logic. ArXiv, abs/2209.00840, 2022.
  16. Llm-guided formal verification coupled with mutation testing. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–2. IEEE, 2024.
  17. Chateda: A large language model powered autonomous agent for eda. 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD), pages 1–6, 2023.
  18. Measuring massive multitask language understanding, 2021.
  19. Mixtral of experts. ArXiv, abs/2401.04088, 2024.
  20. Swe-bench: Can language models resolve real-world github issues? ArXiv, abs/2310.06770, 2023.
  21. Llm-assisted generation of hardware assertions. ArXiv, abs/2306.14027, 2023.
  22. Lfps: Learned formal proof strengthening for efficient hardware verification. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pages 1–9, 2023.
  23. Thomas Kropf. Introduction to formal hardware verification. Springer Science & Business Media, 1999.
  24. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  25. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, 2022.
  26. Chipnemo: Domain-adapted llms for chip design. ArXiv, abs/2311.00176, 2023.
  27. Domain-adapted llms for vlsi design and verification: A case study on formal verification. In 2024 IEEE 42nd VLSI Test Symposium (VTS), pages 1–4. IEEE, 2024.
  28. VerilogEval: evaluating large language models for verilog code generation. In 2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2023.
  29. Meta. Meta llama 3, 2024.
  30. Meta. Meta llama 3.1, 2024.
  31. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
  32. Linc: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. In Conference on Empirical Methods in Natural Language Processing, 2023.
  33. OpenTitan. Opentitan. https://github.com/lowRISC/opentitan, 2024.
  34. Using llms to facilitate formal verification of rtl. ArXiv, abs/2309.09437, 2023.
  35. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. ArXiv, abs/2305.12295, 2023.
  36. Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522, 2024.
  37. Dave: Deriving automatically verilog from english. 2020 ACM/IEEE 2nd Workshop on Machine Learning for CAD (MLCAD), pages 27–32, 2020.
  38. Certified reasoning with language models. ArXiv, abs/2306.04031, 2023.
  39. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. ArXiv, abs/2210.01240, 2022.
  40. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
  41. Towards improving verification productivity with circuit-aware translation of natural language to systemverilog assertions. In First International Workshop on Deep Learning-aided Verification, 2023.
  42. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings, 2020.
  43. Verigen: A large language model for verilog code generation. ArXiv, abs/2308.00708, 2023.
  44. Advanced large language model (llm)-driven verilog development: Enhancing power, performance, and area optimization in code synthesis. ArXiv, abs/2312.01022, 2023.
  45. Hugo Touvron and et al. Llama 2: Open foundation and fine-tuned chat models, 2023.
  46. Rtlfixer: Automatically fixing rtl syntax errors with large language models. ArXiv, abs/2311.16543, 2023.
  47. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
  48. Lego-prover: Neural theorem proving with growing libraries. ArXiv, abs/2310.00656, 2023.
  49. Leandojo: Theorem proving with retrieval-augmented language models. ArXiv, abs/2306.15626, 2023.
  50. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023.
  51. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
  52. Parsel: Algorithmic reasoning with language models by composing decompositions. In Neural Information Processing Systems, 2022.
  53. Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts. arXiv preprint arXiv:2405.04520, 2024.
  54. Automatic assertion generation from natural language specifications using subtree analysis. 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 598–601, 2019.
  55. Don’t trust: Verify - grounding llm quantitative reasoning with autoformalization. ArXiv, abs/2403.18120, 2024.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube