Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SWE-smith: Scaling Data for Software Engineering Agents (2504.21798v2)

Published 30 Apr 2025 in cs.SE, cs.AI, and cs.CL

Abstract: Despite recent progress in LLMs (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

Summary

Analysis of "SWE-smith: Scaling Data for Software Engineering Agents"

The paper introduces SWE-smith, a novel approach aimed at enhancing the scalability of data collection for training LLMs (LMs) that automate software engineering tasks. Despite the known advances in LMs for software engineering, most existing models rely on proprietary datasets, which hinder the development of open-source solutions primarily due to the limitations in training data availability. SWE-smith seeks to address these issues by providing scalable task instance generation mechanisms applicable to Python codebases.

Summary and Methodology

SWE-smith improves upon prior methodologies by presenting a pipeline that can produce a significantly larger dataset compared to existing methods. The authors highlight several weaknesses of current data collection paradigms, such as the complex curation requirements and lack of scalability inherent in SWE-bench's approach. SWE-smith circumvents these through four automated strategies for creating bugs in codebases: LM-based function modification, procedural abstract syntax tree (AST) modification, recombination of bug patches, and pull request (PR) inversion.

The pipeline enables the synthesis of hundreds to thousands of task instances from any Python codebase by constructing execution environments and generating realistic issue descriptions with LMs. By employing SWE-smith, the authors successfully generated a dataset of 50,000 instances from 128 GitHub repositories, an order of magnitude larger than that available in previous open-source work.

Numerical Results and Model Performance

The SWE-agent-LM-32B model was developed using the SWE-smith framework, achieving a pass@1 resolve rate of 40.2% on the SWE-bench Verified benchmark, which is considered state of the art among open-source models. This represents a significant improvement in open-source modeling capabilities, demonstrating the efficacy of SWE-smith's scalable data generation. The models trained with SWE-smith showed consistent improvement with increased training instances, suggesting a positive correlation between dataset size and model performance.

Implications for Research

SWE-smith’s scalable infrastructure has potential implications for the future of LMs in software engineering. The approach facilitates the exploration of novel training techniques beyond fine-tuning, such as reinforcement learning, by providing a broader array of task instances with diverse difficulty levels. This diversity in training instances can be leveraged to test and improve model generalization across varied software domains, which is crucial for real-world applications. Furthermore, the reduction in human labor and storage requirements compared to existing methods suggests that SWE-smith will lower the entry barrier for academic and industrial research on LMs for automated software engineering, catalyzing further developments in open-source AI.

Future Directions

While promising, SWE-smith currently focuses on Python repositories due to the reliance on Python-specific libraries for AST manipulation. Future work could extend SWE-smith’s methodologies to other programming languages, enabling broader applicability across different software platforms. Additionally, the paper hints at the potential for specialization in repository-specific models, which could optimize performance for particular codebases, suggesting a direction for efficient personalization of AI agents.

Overall, SWE-smith provides a robust foundation for the next generation of software engineering agents, promising scalability and efficiency in training data collection that may significantly expedite the development of open-source, automated software engineering solutions. The initiative's release as an open-source toolkit promises to stimulate further innovation and collaboration within the research community.

Youtube Logo Streamline Icon: https://streamlinehq.com