Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks (2403.18327v2)

Published 27 Mar 2024 in cs.CL and cs.AI

Abstract: This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-LLM-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. C. Haubelt and R. Feldmann. SAT-based techniques in system synthesis. In Proc DATE, 2003.
  2. Checking equivalence for partial implementations. In Proc. DAC, 2001.
  3. mT5: A massively multilingual pre-trained text-to-text transformer. In Proc. NAACL-HLT, 2021.
  4. NPHardEval: Dynamic benchmark on reasoning ability of large language models via complexity classes, 2023.
  5. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Proc. EMNLP Findings, 2023.
  6. Diagnosing the first-order logical reasoning ability through LogicNLI. In Proc. EMNLP, 2021a.
  7. Handbook of Satisfiability - Second Edition. IOS Press, 2021. ISBN 978-1-64368-160-3.
  8. Z3: An efficient SMT solver. In Proc. TACAS, 2008.
  9. Generating hard satisfiability problems. AIJ, 81(1-2):17–29, 1996.
  10. OpenAI. Gpt-4-1106-preview. https://arxiv.org/pdf/2303.08774.pdf, 2023a. Accessed: 2023-01-10.
  11. OpenAI. Gpt-3.5-turbo-0613. https://platform.openai.com/docs/models/gpt-3-5, 2023b. Accessed: 2023-01-10.
  12. Mistral AI. Mistral-7b instruct v0.2. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, 2023. Accessed: 2023-01-10.
  13. Google. Gemini pro. https://arxiv.org/pdf/2312.11805.pdf, 2023. Accessed: 2023-01-10.
  14. W. McCune. Prover9 and mace4. http://www.cs.unm.edu/~mccune/prover9/, 2005–2010.
  15. Transformers as soft reasoners over language. In Proc. IJCAI, 2020.
  16. FOLIO: natural language reasoning with first-order logic. CoRR, abs/2209.00840, 2022.
  17. Diagnosing the first-order logical reasoning ability through logicnli. In Proc. EMNLP, 2021b.
  18. Harnessing the power of large language models for natural language to first-order logic translation. CoRR, abs/2305.15541, 2023. doi:10.48550/ARXIV.2305.15541.
  19. Autoformalization with large language models. In Proc. NeurIPS, 2022.
  20. A natural language model for generating pddl. In ICAPS KEPS workshop, 2021.
  21. Translating natural language to planning goals with large-language models. CoRR, abs/2302.05128, 2023.
  22. Using llms to facilitate formal verification of RTL. CoRR, abs/2309.09437, 2023.
  23. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
  24. Exploring large language models for code explanation. In Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, 2023.
  25. Experiences from using code explanations generated by large language models in a web software development e-book. In Proc. SIGCSE, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Rushang Karia (9 papers)
  2. Daksh Dobhal (3 papers)
  3. Daniel Bramblett (3 papers)
  4. Pulkit Verma (15 papers)
  5. Siddharth Srivastava (60 papers)
Citations (1)