$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks (2403.18327v2)
Abstract: This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-LLM-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.
- C. Haubelt and R. Feldmann. SAT-based techniques in system synthesis. In Proc DATE, 2003.
- Checking equivalence for partial implementations. In Proc. DAC, 2001.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proc. NAACL-HLT, 2021.
- NPHardEval: Dynamic benchmark on reasoning ability of large language models via complexity classes, 2023.
- Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Proc. EMNLP Findings, 2023.
- Diagnosing the first-order logical reasoning ability through LogicNLI. In Proc. EMNLP, 2021a.
- Handbook of Satisfiability - Second Edition. IOS Press, 2021. ISBN 978-1-64368-160-3.
- Z3: An efficient SMT solver. In Proc. TACAS, 2008.
- Generating hard satisfiability problems. AIJ, 81(1-2):17–29, 1996.
- OpenAI. Gpt-4-1106-preview. https://arxiv.org/pdf/2303.08774.pdf, 2023a. Accessed: 2023-01-10.
- OpenAI. Gpt-3.5-turbo-0613. https://platform.openai.com/docs/models/gpt-3-5, 2023b. Accessed: 2023-01-10.
- Mistral AI. Mistral-7b instruct v0.2. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, 2023. Accessed: 2023-01-10.
- Google. Gemini pro. https://arxiv.org/pdf/2312.11805.pdf, 2023. Accessed: 2023-01-10.
- W. McCune. Prover9 and mace4. http://www.cs.unm.edu/~mccune/prover9/, 2005–2010.
- Transformers as soft reasoners over language. In Proc. IJCAI, 2020.
- FOLIO: natural language reasoning with first-order logic. CoRR, abs/2209.00840, 2022.
- Diagnosing the first-order logical reasoning ability through logicnli. In Proc. EMNLP, 2021b.
- Harnessing the power of large language models for natural language to first-order logic translation. CoRR, abs/2305.15541, 2023. doi:10.48550/ARXIV.2305.15541.
- Autoformalization with large language models. In Proc. NeurIPS, 2022.
- A natural language model for generating pddl. In ICAPS KEPS workshop, 2021.
- Translating natural language to planning goals with large-language models. CoRR, abs/2302.05128, 2023.
- Using llms to facilitate formal verification of RTL. CoRR, abs/2309.09437, 2023.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
- Exploring large language models for code explanation. In Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, 2023.
- Experiences from using code explanations generated by large language models in a web software development e-book. In Proc. SIGCSE, 2023.
- Rushang Karia (9 papers)
- Daksh Dobhal (3 papers)
- Daniel Bramblett (3 papers)
- Pulkit Verma (15 papers)
- Siddharth Srivastava (60 papers)