MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution (2403.17927v2)
Abstract: In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. LLMs have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.
- Anthropic. Claude 2. https://www.anthropic.com/news/claude-2, 2023.
- Program synthesis with large language models. arXiv Preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
- Factors influencing code review processes in industry. In Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su, editors, Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pages 85–96. ACM, 2016. doi: 10.1145/2950290.2950323. URL https://doi.org/10.1145/2950290.2950323.
- Got issues? who cares about it? A large scale investigation of issue trackers from github. In IEEE 24th International Symposium on Software Reliability Engineering, ISSRE 2013, Pasadena, CA, USA, November 4-7, 2013, pages 188–197. IEEE Computer Society, 2013. doi: 10.1109/ISSRE.2013.6698918. URL https://doi.org/10.1109/ISSRE.2013.6698918.
- Impact of developer reputation on code review outcomes in OSS projects: an empirical investigation. In Maurizio Morisio, Tore Dybå, and Marco Torchiano, editors, 2014 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’14, Torino, Italy, September 18-19, 2014, pages 33:1–33:10. ACM, 2014. doi: 10.1145/2652524.2652544. URL https://doi.org/10.1145/2652524.2652544.
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv Preprint, abs/2303.12712, 2023. doi: 10.48550/ARXIV.2303.12712. URL https://doi.org/10.48550/arXiv.2303.12712.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv Preprint, abs/2308.07201, 2023. doi: 10.48550/ARXIV.2308.07201. URL https://doi.org/10.48550/arXiv.2308.07201.
- PTP: boosting stability and performance of prompt tuning with perturbation-based regularizer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13512–13525. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.833.
- Evaluating large language models trained on code. arXiv Preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Metagpt: Meta programming for a multi-agent collaborative framework, 2023.
- Large language models for software engineering: A systematic literature review. arXiv Preprint, abs/2308.10620, 2023. doi: 10.48550/ARXIV.2308.10620. URL https://doi.org/10.48550/arXiv.2308.10620.
- Practitioners’ expectations on automated code comment generation. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pages 1693–1705. ACM, 2022. doi: 10.1145/3510003.3510152. URL https://doi.org/10.1145/3510003.3510152.
- Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
- Thomas Johnsson. Attribute grammars as a functional programming paradigm. In Gilles Kahn, editor, Functional Programming Languages and Computer Architecture, Portland, Oregon, USA, September 14-16, 1987, Proceedings, volume 274 of Lecture Notes in Computer Science, pages 154–173. Springer, 1987. doi: 10.1007/3-540-18317-5_10. URL https://doi.org/10.1007/3-540-18317-5_10.
- Investigating code review quality: Do people and participation matter? In Rainer Koschke, Jens Krinke, and Martin P. Robillard, editors, 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015, pages 111–120. IEEE Computer Society, 2015. doi: 10.1109/ICSM.2015.7332457. URL https://doi.org/10.1109/ICSM.2015.7332457.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. URL http://papers.nips.cc/paper_files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html.
- Lost in the middle: How language models use long contexts. arXiv Preprint, abs/2307.03172, 2023b. doi: 10.48550/ARXIV.2307.03172. URL https://doi.org/10.48550/arXiv.2307.03172.
- Lost in the middle: How language models use long contexts. arXiv Preprint, abs/2307.03172, 2023c. doi: 10.48550/ARXIV.2307.03172. URL https://doi.org/10.48550/arXiv.2307.03172.
- The impact of code review coverage and code review participation on software quality: a case study of the qt, vtk, and ITK projects. In Premkumar T. Devanbu, Sung Kim, and Martin Pinzger, editors, 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, pages 192–201. ACM, 2014. doi: 10.1145/2597073.2597076. URL https://doi.org/10.1145/2597073.2597076.
- Developer-intent driven code comment generation. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 768–780. IEEE, 2023. doi: 10.1109/ICSE48619.2023.00073. URL https://doi.org/10.1109/ICSE48619.2023.00073.
- OpenAI. GPT-4 technical report, 2023. URL https://doi.org/10.48550/arXiv.2303.08774.
- OpenAI. Gpt-3.5 turbo fine-tuning and api updates. https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates, 2023a.
- OpenAI. Gpt-4. https://openai.com/research/gpt-4, 2023b.
- Communicative agents for software development. arXiv Preprint, 2023.
- Improving language understanding by generative pre-training, 2018.
- Okapi at TREC-3. In Donna K. Harman, editor, Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 109–126. National Institute of Standards and Technology (NIST), 1994. URL http://trec.nist.gov/pubs/trec3/papers/city.ps.gz.
- Jessica Shieh. Best practices for prompt engineering with openai api. OpenAI, February https://help. openai. com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, 2023.
- A survey of neural code intelligence: Paradigms, advances and beyond, 2024.
- Multi-agent collaboration: Harnessing the power of intelligent LLM agents. arXiv Preprint, abs/2306.03314, 2023. doi: 10.48550/ARXIV.2306.03314. URL https://doi.org/10.48550/arXiv.2306.03314.
- Kadel: Knowledge-aware denoising learning for commit message generation. ACM Trans. Softw. Eng. Methodol., jan 2024. ISSN 1049-331X. doi: 10.1145/3643675. URL https://doi.org/10.1145/3643675.
- LLaMA-MoE Team. Llama-moe: Building mixture-of-experts from llama with continual pre-training, Dec 2023. URL https://github.com/pjlab-sys4nlp/llama-moe.
- The Cognition Team. Swe-bench technical report, 2024. URL https://www.cognition-labs.com/post/swe-bench-technical-report.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Autodev: Automated ai-driven development, 2024.
- Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. arXiv Preprint, abs/2308.08155, 2023. doi: 10.48550/ARXIV.2308.08155. URL https://doi.org/10.48550/arXiv.2308.08155.
- A survey of large language models. arXiv Preprint, abs/2303.18223, 2023. doi: 10.48550/ARXIV.2303.18223. URL https://doi.org/10.48550/arXiv.2303.18223.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. URL http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
- Towards an understanding of large language models in software engineering tasks. arXiv Preprint, abs/2308.11396, 2023b. doi: 10.48550/ARXIV.2308.11396. URL https://doi.org/10.48550/arXiv.2308.11396.
- Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Trans. Software Eng., 47(2):243–260, 2021. doi: 10.1109/TSE.2018.2887384. URL https://doi.org/10.1109/TSE.2018.2887384.
- Thread of thought unraveling chaotic contexts. arXiv Preprint, abs/2311.08734, 2023. doi: 10.48550/ARXIV.2311.08734. URL https://doi.org/10.48550/arXiv.2311.08734.