An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications (2404.11050v1)
Abstract: Automatic Program Repair (APR) has garnered significant attention as a practical research domain focused on automatically fixing bugs in programs. While existing APR techniques primarily target imperative programming languages like C and Java, there is a growing need for effective solutions applicable to declarative software specification languages. This paper presents a systematic investigation into the capacity of LLMs for repairing declarative specifications in Alloy, a declarative formal language used for software specification. We propose a novel repair pipeline that integrates a dual-agent LLM framework, comprising a Repair Agent and a Prompt Agent. Through extensive empirical evaluation, we compare the effectiveness of LLM-based repair with state-of-the-art Alloy APR techniques on a comprehensive set of benchmarks. Our study reveals that LLMs, particularly GPT-4 variants, outperform existing techniques in terms of repair efficacy, albeit with a marginal increase in runtime and token usage. This research contributes to advancing the field of automatic repair for declarative specifications and highlights the promising potential of LLMs in this domain.
- [n. d.]. AutoGPT. [Accessed: 30-03-2024].
- [n. d.]. D. Jackson, Software Abstractions, 2nd ed. MIT Press, 2012.
- 2021. Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021). https://api.semanticscholar.org/CorpusID:235755472
- 2024. Langroid. https://github.com/langroid/langroid Accessed: 2024-02-27.
- Scalable analysis of interaction threats in IoT systems. In ISSTA’20: 29th ACM SIGSOFT. 272–285.
- Synthesis of assurance cases for software certification. In ICSE-NIER 2020: 42nd International Conference on Software Engineering, New Ideas and Emerging Results, Seoul, South Korea, 27 June - 19 July, 2020. ACM, 61–64.
- Practical, Formal Synthesis and Automatic Enforcement of Security Policies for Android. In Proceedings of DSN. 514–525.
- Flair: efficient analysis of Android inter-component vulnerabilities in response to incremental changes. Empir. Softw. Eng. 26, 3 (2021), 54.
- RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE]
- Bounded Exhaustive Search of Alloy Specification Repairs. In Proceedings of the 43rd International Conference on Software Engineering (Madrid, Spain) (ICSE ’21). IEEE Press, 1135–1147. https://doi.org/10.1109/ICSE43902.2021.00105
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]
- Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain Accessed: 2024-02-27.
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
- Automated Repair of Programs from Large Language Models. arXiv:2205.10583 [cs.SE]
- ICEBAR: Feedback-Driven Iterative Repair of Alloy Specifications. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 55, 13 pages. https://doi.org/10.1145/3551349.3556944
- Automated Repair of Declarative Software Specifications in the Era of Large Language Models. arXiv:2310.12425 [cs.SE]
- Daniel Jackson. 2006. Software Abstractions - Logic, Language, and Analysis. MIT Press.
- Jigsaw: Large Language Models Meet Program Synthesis. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1219–1231. https://doi.org/10.1145/3510003.3510203
- Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. arXiv:2304.02195 [cs.SE]
- Sarfraz Khurshid and Darko Marinov. 2004. TestEra: Specification-based Testing of Java Programs Using SAT. Automated Software Engineering 11 (2004).
- ContrastRepair: Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs. arXiv preprint arXiv:2403.01971 (2024).
- LLM-CompDroid: Repairing Configuration Compatibility Bugs in Android Apps with Pre-trained Large Language Models. arXiv:2402.15078 [cs.SE]
- Experiences on Teaching Alloy with an Automated Assessment Platform. Sci. Comput. Program. 211, C (nov 2021), 21 pages. https://doi.org/10.1016/j.scico.2021.102690
- Self-Refine: Iterative Refinement with Self-Feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- Reducing Combinatorics in GUI Testing of Android Applications. In Proceedings ICSE. 559–570.
- OpenAI. [n. d.]. New models and developer products announced at DevDay — openai.com. https://openai.com/blog/new-models-and-developer-products-announced-at-devday. [Accessed 29-03-2024].
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Enhancing Automated Program Repair through Fine-tuning and Prompt Engineering. arXiv:2304.07840 [cs.LG]
- An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. https://doi.org/10.1109/TSE.2023.3334955
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems 36 (2024).
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- ARepair: A Repair Framework for Alloy. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 103–106. https://doi.org/10.1109/ICSE-Companion.2019.00049
- A Survey on Large Language Model based Autonomous Agents. arXiv e-prints (2023), arXiv–2308.
- Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the 46th International Conference on Software Engineering (ICSE ’24).
- Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational Automated Program Repair. arXiv:2301.13246 [cs.SE]
- A Survey of Learning-based Automated Program Repair. ACM Trans. Softw. Eng. Methodol. 33, 2, Article 55 (dec 2023), 69 pages. https://doi.org/10.1145/3631974
- AutoCodeRover: Autonomous Program Improvement. arXiv preprint arXiv:2404.05427 (2024).
- ATR: Template-Based Repair for Alloy Specifications. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 666–677. https://doi.org/10.1145/3533767.3534369
- Agents: An Open-source Framework for Autonomous Language Agents. arXiv:2309.07870 [cs.CL]
- Mohannad Alhanahnah (11 papers)
- Md Rashedul Hasan (2 papers)
- Hamid Bagheri (7 papers)