Emergent Mind

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Published Mar 26, 2024 in cs.SE and cs.AI


In software evolution, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing functionalities. Large Language Models (LLMs) have shown promise in code generation and understanding but face difficulties in code change, particularly at the repository level. To overcome these challenges, we empirically study the reason why LLMs mostly fail to resolve GitHub issues and analyze some impact factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four kinds of agents customized for the software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, which significantly outperforms the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the based LLM of our method. We also analyze the factors for improving GitHub issue resolution rates, such as line location, task allocation, etc.
MAGIS framework overview, presenting its structure and operational logic in navigating data processes.


  • MAGIS introduces an LLM-based Multi-Agent framework designed to improve GitHub issue resolution, featuring specialized roles for collaboration.

  • Empirical analysis shows MAGIS outperforms existing LLMs, such as GPT-4, by a significant margin in the resolved ratio of GitHub issues.

  • The framework mimics human workflows to optimize LLM capabilities, enhancing efficiency in software development and issue resolution.

  • MAGIS's success in handling complex software modifications suggests a promising future for AI in software evolution and development processes.

Introduction to MAGIS

Managing GitHub issues is a significant aspect of software evolution, demanding sophisticated solutions that account for both the introduction of new functionalities and the maintenance of existing ones. Given the prowess of LLMs in code generation and comprehension, their application to software development processes, especially in handling repository-level tasks like GitHub issue resolution, beckons exploration.

In response, we propose MAGIS, an LLM-based Multi-Agent framework for GitHub Issue reSolution. The framework introduces a collaborative mechanism among specialized agents—Manager, Repository Custodian, Developer, and Quality Assurance Engineer—each playing a critical role aimed at facilitating LLMs in overcoming repository-level coding challenges. Our model not only demonstrates a significant improvement over existing LLMs in resolving GitHub issues but also lays the groundwork for future advancements in AI-assisted software evolution.

Empirical Analysis

Our examination reveals two primary factors affecting the performance of LLMs in issue resolution: the accuracy of line location for code changes and the overall complexity of these changes. The findings underscore the pivotal role of precisely identifying code modification locations and managing the complexity of alterations, particularly in settings without Oracle.

The efficacy of our framework in these contexts is evident through a comprehensive comparison against popular LLMs on the SWE-bench. The experiments highlight an eight-fold improvement in the resolved ratio over the base LLM, GPT-4, signifying a robust groundwork for further exploration.

MAGIS Framework: Roles and Collaborative Process

MAGIS introduces an innovative approach, deriving inspiration from traditional human workflows yet distinctly tailored to optimize LLM capabilities. Each agent within our framework performs specific roles—ranging from identifying pertinent files in repositories to ensuring the quality of code changes—which collectively streamline the issue resolution process. This structured collaboration not only enhances the efficiency of LLM applications but also aligns with the established practices of software development, thus bridging the gap between AI potentials and practical requirements.

Experimental Validation and Outcomes

The effectiveness of our framework is validated across various dimensions—overall issue resolution, file location recall, and the intricate processes of planning and coding. Our findings demonstrate that MAGIS considerably outperforms benchmark LLMs in the domain of GitHub issue resolution. Particularly, our approach exhibits a consistent ability to handle complex modifications, often presenting viable solutions that, in certain instances, are more concise than their human-generated counterparts.

Insights and Future Directions

The significant advancements heralded by MAGIS in utilizing LLMs for software evolution emphasize the potential of AI in navigating the complexities of software development. The framework not only showcases the capacity to increase the efficiency of addressing GitHub issues but also sets a substantive foundation for the exploration of AI's role in broader aspects of software maintenance and evolution.

Moreover, the introduction of a collaborative multi-agent system paves the way for future research into optimizing these interactions and further leveraging AI capabilities in software development processes. As LLMs continue to evolve, frameworks like MAGIS could become integral components of the software development lifecycle, augmenting human efforts with AI-driven insights and solutions.

In conclusion, MAGIS represents a significant stride toward harnessing the power of LLMs in software evolution, highlighting the immense potential that lies in the intersection of AI and software development. The journey forward is promising, with MAGIS providing a beacon for future endeavors in this evolving landscape.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. Anthropic. Claude 2. https://www.anthropic.com/news/claude-2

  2. Program Synthesis with Large Language Models
  3. Factors influencing code review processes in industry. In Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su, editors, Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pages 85–96. ACM, 2016. doi: 10.1145/2950290.2950323. https://doi.org/10.1145/2950290.2950323.
  4. Got issues? who cares about it? A large scale investigation of issue trackers from github. In IEEE 24th International Symposium on Software Reliability Engineering, ISSRE 2013, Pasadena, CA, USA, November 4-7, 2013, pages 188–197. IEEE Computer Society, 2013. doi: 10.1109/ISSRE.2013.6698918. https://doi.org/10.1109/ISSRE.2013.6698918.
  5. Impact of developer reputation on code review outcomes in OSS projects: an empirical investigation. In Maurizio Morisio, Tore Dybå, and Marco Torchiano, editors, 2014 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’14, Torino, Italy, September 18-19, 2014, pages 33:1–33:10. ACM, 2014. doi: 10.1145/2652524.2652544. https://doi.org/10.1145/2652524.2652544.
  6. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  7. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
  8. PTP: boosting stability and performance of prompt tuning with perturbation-based regularizer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13512–13525. Association for Computational Linguistics, 2023. https://aclanthology.org/2023.emnlp-main.833.

  9. Evaluating Large Language Models Trained on Code
  10. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation
  11. Openllama: An open reproduction of llama, May 2023. https://github.com/openlm-research/open_llama.

  12. Metagpt: Meta programming for a multi-agent collaborative framework
  13. Large Language Models for Software Engineering: A Systematic Literature Review
  14. Practitioners’ expectations on automated code comment generation. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pages 1693–1705. ACM, 2022. doi: 10.1145/3510003.3510152. https://doi.org/10.1145/3510003.3510152.
  15. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. https://openreview.net/forum?id=VTF8yNQM66.

  16. Thomas Johnsson. Attribute grammars as a functional programming paradigm. In Gilles Kahn, editor, Functional Programming Languages and Computer Architecture, Portland, Oregon, USA, September 14-16, 1987, Proceedings, volume 274 of Lecture Notes in Computer Science, pages 154–173. Springer, 1987. doi: 10.1007/3-540-18317-510. https://doi.org/10.1007/3-540-18317-510.

  17. Investigating code review quality: Do people and participation matter? In Rainer Koschke, Jens Krinke, and Martin P. Robillard, editors, 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015, pages 111–120. IEEE Computer Society, 2015. doi: 10.1109/ICSM.2015.7332457. https://doi.org/10.1109/ICSM.2015.7332457.
  18. Is your code generated by chatgpt really correct? rigorous evaluation of LLMs for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. http://papers.nips.cc/paper_files/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Conference.html.

  19. Lost in the Middle: How Language Models Use Long Contexts
  20. Lost in the Middle: How Language Models Use Long Contexts
  21. The impact of code review coverage and code review participation on software quality: a case study of the qt, vtk, and ITK projects. In Premkumar T. Devanbu, Sung Kim, and Martin Pinzger, editors, 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, pages 192–201. ACM, 2014. doi: 10.1145/2597073.2597076. https://doi.org/10.1145/2597073.2597076.
  22. Developer-intent driven code comment generation. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 768–780. IEEE, 2023. doi: 10.1109/ICSE48619.2023.00073. https://doi.org/10.1109/ICSE48619.2023.00073.
  23. GPT-4 Technical Report
  24. OpenAI. Gpt-3.5 turbo fine-tuning and api updates. https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates, 2023a.

  25. OpenAI. GPT-4. https://openai.com/research/gpt-4, 2023b.

  26. Communicative agents for software development
  27. Improving language understanding by generative pre-training
  28. Okapi at TREC-3. In Donna K. Harman, editor, Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 109–126. National Institute of Standards and Technology (NIST), 1994. http://trec.nist.gov/pubs/trec3/papers/city.ps.gz.

  29. Jessica Shieh. Best practices for prompt engineering with openai api. OpenAI, February https://help. openai. com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api

  30. A survey of neural code intelligence: Paradigms, advances and beyond
  31. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
  32. Kadel: Knowledge-aware denoising learning for commit message generation. ACM Trans. Softw. Eng. Methodol., jan 2024. ISSN 1049-331X. doi: 10.1145/3643675. https://doi.org/10.1145/3643675.

  33. LLaMA-MoE Team. Llama-moe: Building mixture-of-experts from llama with continual pre-training, Dec 2023. https://github.com/pjlab-sys4nlp/llama-moe.

  34. The Cognition Team. Swe-bench technical report, 2024. https://www.cognition-labs.com/post/swe-bench-technical-report.

  35. LLaMA: Open and Efficient Foundation Language Models
  36. Autodev: Automated ai-driven development
  37. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
  38. A Survey of Large Language Models
  39. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.

  40. Towards an Understanding of Large Language Models in Software Engineering Tasks
  41. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Trans. Software Eng., 47(2):243–260, 2021. doi: 10.1109/TSE.2018.2887384. https://doi.org/10.1109/TSE.2018.2887384.
  42. Thread of Thought Unraveling Chaotic Contexts

Show All 42

Test Your Knowledge

You answered out of questions correctly.

Well done!