Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement (2411.00622v1)

Published 1 Nov 2024 in cs.SE and cs.AI
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement

Abstract: Recent advancements in LLM-based agents have led to significant progress in automatic software engineering, particularly in software maintenance and evolution. Despite these encouraging advances, current research faces two major challenges. First, SOTA performance primarily depends on closed-source models, which significantly limits the technology's accessibility, and potential for customization in diverse SE tasks. Second, these models are predominantly trained on static code data, lacking a deep understanding of the dynamic interactions, iterative problem-solving processes, and evolutionary characteristics inherent in software development. To address these challenges, our study adopts a software engineering perspective. We recognize that real-world software maintenance and evolution processes encompass not only static code data but also developers' thought processes, utilization of external tools, and the interaction between different functional personnel. Consequently, we introduce the Lingma SWE-GPT series, comprising Lingma SWE-GPT 7B and 72B. By learning from and simulating real-world code submission activities, Lingma SWE-GPT systematically incorporates the dynamic interactions and iterative problem-solving inherent in software development process, thereby achieving a more comprehensive understanding of software improvement processes. We conducted experimental evaluations using SWE-bench Verified benchmark. The results demonstrate that Lingma SWE-GPT 72B successfully resolves 30.20% of the GitHub issues, marking a significant improvement in automatic issue resolution (22.76% relative improvement compared to Llama 3.1 405B), approaching the performance of closed-source models (31.80\% issues of GPT-4o resolved). Notably, Lingma SWE-GPT 7B resolves 18.20% of the issues, highlighting the potential for applying smaller models to ASE tasks.

An Evaluation of Lingma SWE-GPT for Automated Software Improvement

The paper presents Lingma SWE-GPT, a series of open-source LLMs targeted at improving the efficiency and effectiveness of automated software improvement tasks. The presented models, Lingma SWE-GPT 72B and Lingma SWE-GPT 7B, represent a notable effort to match the performance of existing closed-source models like GPT-4o and Claude 3.5 Sonnet, while maintaining accessibility and alleviating privacy concerns associated with closed systems.

Lingma SWE-GPT significantly progresses from the reliance on static code data, addressing a key limitation in existing models. It employs a more dynamic approach by adopting a workflow that mimics the real-world software engineering process. The model follows a well-defined development pipeline: repository understanding, fault localization, and patch generation. This method notably allows the model to adapt to dynamic interactions and iterative processes inherent in software development. Such a technique could advance the real-world utility of LLMs in practical software development scenarios where the ability to discern complex project structures and generate context-sensitive solutions is critical.

The paper introduces an extensive evaluation of Lingma SWE-GPT that relies on SWE-bench Verified and Lite benchmarks, setting it vis-à-vis existing open-source and closed-source models. The results reveal that Lingma SWE-GPT 72B resolves 30.20% of GitHub issues, showcasing a 22.76% relative improvement in automatic issue resolution capabilities over Llama 3.1 405B, and nearly equaling the performance of GPT-4o. This finding demonstrates strong potential for open-source models in practical automated software tasks, presenting a more accessible counterpart to the currently dominant closed-source models.

Additionally, the model explores how smaller-scale models, such as Lingma SWE-GPT 7B, can produce competitive results. With a resolution rate of 18.20% on the benchmarks, the 7B model surpasses the 17.20% resolution rate of Llama 3.1 70B. This highlights the utility and efficiency of smaller models, which might appeal to settings with constrained computational resources.

One of the pivotal contributions of the paper is the development process-centric training strategy, which is key to the model's robust performance. This approach efficiently leverages the real-world dynamics of software processes, refining the models through curated development data synthesis, including reasoning patterns, tool interactions, and practical problem resolutions. By incorporating a comprehensive curriculum training strategy, Lingma SWE-GPT shows improved capabilities to handle increasingly complex software tasks with higher reliability.

The implications of this research, both practical and theoretical, invite future exploration in several directions. Practically, Lingma SWE-GPT establishes a framework that democratizes access to high-performing automation tools for software improvement. Theoretically, the results underline the importance of a dynamic and process-oriented training paradigm to enhance the contextual understanding and execution abilities of LLMs in intricate, real-world applications. The authors speculate that further advancements could explore more sophisticated tool usage, reasoning, and verification capabilities, crucial for advancing AI-assisted software engineering into more extensive domains and broader stages of the software lifecycle.

In summary, this paper clearly defines a path towards accessible and efficient models for software engineering tasks, challenging the status quo dominated by closed-source models. The research lays a foundation for future inquiries into enhancing LLMs' comprehension and reasoning capabilities, with profound implications for the automation and quality enhancement of software engineering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. How to refactor this code? An exploratory study on developer-ChatGPT refactoring conversations. In Proceedings of the 21st International Conference on Mining Software Repositories. 202–206.
  3. Anthropic. 2024a. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet
  4. Anthropic. 2024b. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family
  5. SWE-bench Lite: A Canonical Subset for Efficient Evaluation of Language Models as Software Engineers. https://www.swebench.com/lite.html
  6. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Vol. 161175. Ann Arbor, Michigan, 14.
  7. Saikat Chakraborty and Baishakhi Ray. 2021. On multi-modal learning of editing source code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 443–455.
  8. CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint arXiv:2406.01304 (2024).
  9. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG]
  10. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
  11. Cognition. 2023. Introducing Devin. https://www.cognition.ai/introducing-devin
  12. Cycle: Learning to self-refine the code generation. Proceedings of the ACM on Programming Languages 8, OOPSLA1 (2024), 392–418.
  13. AST-T5: Structure-Aware Pretraining for Code Generation and Understanding. arXiv preprint arXiv:2401.03003 (2024).
  14. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
  15. Towards Better Multilingual Code Search through Cross-Lingual Contrastive Learning. In Proceedings of the 14th Asia-Pacific Symposium on Internetware. 22–32.
  16. Huggingface Open LLM Leaderboard. 2024. Dataset Card for Evaluation run of Qwen. https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-72B-details
  17. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024).
  18. Automatic Code Annotation Generation Based on Heterogeneous Graph Structure. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 497–508.
  19. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 (2023).
  20. ContrastRepair: Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs. arXiv preprint arXiv:2403.01971 (2024).
  21. A Unified Debugging Approach via LLM-Based Multi-Agent Synergy. arXiv preprint arXiv:2404.17153 (2024).
  22. A two-stage framework for ambiguous classification in software engineering. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 275–286.
  23. Codeeditor: Learning to edit source code with pre-trained models. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–22.
  24. DLLens: Testing Deep Learning Libraries via LLM-aided Synthesis. arXiv preprint arXiv:2406.07944 (2024).
  25. Mftcoder: Boosting code llms with multitask fine-tuning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5430–5441.
  26. Large Language Model-Based Agents for Software Engineering: A Survey. arXiv preprint arXiv:2409.02977 (2024).
  27. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
  28. CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases. arXiv preprint arXiv:2408.03910 (2024).
  29. MarsCode Agent: AI-native Automated Bug Fixing. arXiv preprint arXiv:2409.00899 (2024).
  30. StarCoder 2 and The Stack v2: The Next Generation. arXiv preprint arXiv:2402.19173 (2024).
  31. RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation. arXiv preprint arXiv:2402.16667 (2024).
  32. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568 (2023).
  33. At Which Training Stage Does Code Data Help LLMs Reasoning? arXiv preprint arXiv:2309.16298 (2023).
  34. How to Understand Whole Software Repository? arXiv preprint arXiv:2406.01422 (2024).
  35. Mulcs: Towards a unified deep representation for multilingual code search. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 120–131.
  36. Meta. 2024. Introducing Llama 3.1. https://ai.meta.com/blog/meta-llama-3-1/
  37. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124 (2023).
  38. NExT: Teaching Large Language Models to Reason about Code Execution. arXiv preprint arXiv:2404.14662 (2024).
  39. OpenAI. 2024a. Introducing GPT-4o. https://openai.com/index/hello-gpt-4o/
  40. OpenAI. 2024b. Introducing SWE-bench Verified. https://openai.com/index/introducing-swe-bench-verified/
  41. Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109 (2023).
  42. Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? arXiv preprint arXiv:2410.01353 (2024).
  43. Paul Gauthier. 2024. Aider is ai pair programming in your terminal. https://aider.chat/2024
  44. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020).
  45. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  46. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936 (2023).
  47. From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging. arXiv preprint arXiv:2410.01215 (2024).
  48. Refactoring programs using large language models with few-shot examples. In 2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 151–160.
  49. CodeGemma Team. 2024. Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409 (2024).
  50. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  51. Vikas Thada and Vivek Jaglan. 2013. Comparison of jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology 2, 4 (2013), 202–205.
  52. Executable Code Actions Elicit Better LLM Agents. arXiv:2402.01030 [cs.CL]
  53. Magicoder: Empowering code generation with oss-instruct. In Forty-first International Conference on Machine Learning.
  54. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489 (2024).
  55. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
  56. How to Pet a Two-Headed Snake? Solving Cross-Repository Compatibility Issues with Hera. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 694–705.
  57. CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution. arXiv preprint arXiv:2408.13001 (2024).
  58. ACWRecommender: A Tool for Validating Actionable Warnings with Weak Supervision. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1876–1880.
  59. SelfPiCo: Self-Guided Partial Code Execution with LLMs. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1389–1401.
  60. Better Debugging: Combining Static Analysis and LLMs for Explainable Crashing Fault Localization. arXiv preprint arXiv:2408.12070 (2024).
  61. Qwen2 technical report. arXiv preprint arXiv:2407.10671 (2024).
  62. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793 (2024).
  63. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187 (2023).
  64. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems 35 (2022), 15476–15488.
  65. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. arXiv preprint arXiv:2401.07339 (2024).
  66. Diversity empowers intelligence: Integrating expertise of software engineering agents. arXiv preprint arXiv:2408.07060 (2024).
  67. Lampr: Boosting the Effectiveness of Language-Generic Program Reduction via Large Language Models. arXiv preprint arXiv:2312.13064 (2023).
  68. LPR: Large Language Models-Aided Program Reduction. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 261–273.
  69. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604.
  70. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658 (2024).
  71. Ming Zhu and Yi Zhou. 2024. MOSS: Enabling Code-Driven Evolution and Context Management for AI Agents. arXiv preprint arXiv:2409.16120 (2024).
  72. DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation. arXiv preprint arXiv:2408.13204 (2024).
  73. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint arXiv:2406.11931 (2024).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yingwei Ma (15 papers)
  2. Rongyu Cao (14 papers)
  3. Yongchang Cao (4 papers)
  4. Yue Zhang (618 papers)
  5. Jue Chen (5 papers)
  6. Yibo Liu (34 papers)
  7. Yuchen Liu (156 papers)
  8. Binhua Li (30 papers)
  9. Fei Huang (408 papers)
  10. Yongbin Li (128 papers)