SWE-smith: Automated SE Data Pipeline

Updated 19 December 2025

SWE-smith is an automated pipeline that generates large-scale, bug-driven training data across Python repositories, enabling more robust LM-based software engineering research.
It employs diverse synthesis strategies—including LM-based modifications, procedural mutations, PR mirroring, and patch combinations—to ensure realistic and reproducible bug instances.
The system minimizes manual curation and storage overhead by automating environment setup, bug validation, and issue description synthesis within scalable, containerized workflows.

SWE-smith is an automated pipeline for generating large-scale, faithful software engineering training data tailored for LLM (LM) agents. It targets the central bottleneck in data collection for software engineering benchmarks: previous datasets contain only thousands of instances from a limited set of repositories and require laborious curation and extensive storage resources. SWE-smith enables scalable creation of bug-driven tasks across hundreds of Python repositories, complete with runnable environments and issue descriptions, thereby substantially expanding the scope and realism of available data for LM-based software engineering research (Yang et al., 30 Apr 2025).

1. System Overview and Motivation

Modern LLMs deliver strong performance on software engineering (SE) tasks, but open-source models substantially trail closed-weight systems in benchmarks such as SWE-bench Verified. Analysis identifies lack of large, diverse, and reproducible datasets as the primary cause: existing resources comprise thousands of instances from up to 11 GitHub repositories and require companion environments consuming terabytes of storage. SWE-smith introduces a fully automated process that, given any Python codebase, synthesizes hundreds to thousands of realistic bugs guaranteed to break at least one unit test, and generates corresponding issue descriptions. This facilitates downstream LM fine-tuning for automated SE tasks and lowers the entry barrier for agent-based systems research.

2. Pipeline Architecture and Workflow

The SWE-smith pipeline executes in four main stages:

Repository Ingestion & Environment Construction: SWE-smith targets the top 5,000 PyPI Python projects (each with ≥ 1,000 GitHub stars, excluding SWE-bench originals). For each selected repository, SWE-agent (Claude 3.5 Sonnet) automates discovery of installation and test commands, proceeding for up to 100 steps until ≥80% of tests pass. Each (repo, commit) pair is captured in a single Docker image, optimizing environment storage.
Bug Candidate Generation: Four scalable strategies are employed:
- LM Generation:
  - LM-Modify prompts a 2,000-token LLM to introduce a logical bug into a working function/class (e.g., off-by-one errors, inverted conditions).
  - LM-Rewrite removes the body of a target function, then prompts the LLM to reimplement—frequently introducing subtle defects.
- Procedural Modifications: Abstract syntax tree (AST) traversal applies 13 mutation operators (e.g., conditional removal, operator inversion) under filtering criteria to generate bug candidates.
- PR Mirroring: Collects recent GitHub PR diffs (2023–), prompts an LLM to "undo" those changes, and injects them into the current codebase, simulating real bug-fix scenarios.
- Patch Combination: Combines validated patches at file or module level (2–5 per module) for higher complexity bugs.
Execution-Based Validation & Test Extraction: Each candidate modification is applied in the constructed environment; the full test suite is rerun (120s timeout). If any previously passing test now fails, the instance is retained, and the sets of fail-to-pass ( $T^-$ ) and pass-to-pass ( $T^+$ ) tests are recorded.
Issue Description Synthesis: An LLM is prompted with the code delta, code for a randomly selected failed test, and the test log; it generates a human-like GitHub issue reproducer with minimal code to trigger the failure, ensuring realistic and non-trivial issue text.

3. Dataset Composition and Characteristics

Application of SWE-smith yields a dataset of 50,137 instances sourced from 128 Python repositories, packaged in environments totaling 125 Docker images (295 GB). The average repository provides 381 instances (up to 2,277 for pandas-dev/pandas). Stats indicate the following yields for each bug synthesis strategy:

Strategy	% Candidates Valid	Instances Produced
LM-Modify	55.9%	17,887
LM-Rewrite	35.0%	4,173
Procedural Mods	40.2%	15,641
PR Mirroring	33.8%	2,344
Patch Combination	96.9%	10,092

Difficulty scores (easy=1, medium=5, hard=9) are assigned using a classifier trained on SWE-bench labels, yielding an average of ~6.0—comparable to prior benchmarks.

4. Model Training and Agent Interfaces

SWE-agent-LM-32B employs the Qwen 2.5 Coder-Instruct 32B model (32B parameters), pretrained on code and text with a context window of 32,768 tokens. Fine-tuning utilizes 5,016 expert trajectories (out of 6,457 attempts on 8,686 tasks, 36% resolve rate with Claude 3.7), capped at 3 per task to control sampling bias. Optimization uses full-parameter tuning via torchtune (learning rate $5 \times 10^{-5}$ , 3 epochs). The agent outputs ReAct-style (thought, action) pairs; permitted actions are shell commands (bash(...)), code editing (str_replace_editor(...)), and result submission (submit()).

5. Evaluation Results and Ablations

Model performance is assessed on three splits: 5,016 SWE-smith trajectories (training), 300 SWE-bench Lite instances (validation), and 500 SWE-bench Verified instances (test). The key metric is Resolve Rate (Pass@1)—percentage of final submissions passing all original tests in single-shot, with $k=1$ (no verifiers, no temperature scaling). SWE-agent-LM-32B achieves 40.2% Pass@1 on SWE-bench Verified, setting the open-weight state-of-the-art. Comparative results include:

Model	Lite (%)	Verified (%)
Claude 3.7 Sonnet + SWE-agent	48.0	58.2
GPT-4o + SWE-agent	32.0	38.8
OpenHands, AutoCodeRover	41.7	53.0
SWE-fixer NoExec (72B)	24.7	32.8
R2E-Gym 32B	–	34.4
SWE-gym 32B	15.3	20.6
SWE-agent-LM-7B	11.7	15.2
SWE-agent-LM-32B	30.7	40.2

Ablation studies reveal that PR Mirror and LM Rewrite strategies yield highest verified performance (~9%), with procedural (~8.6%) and LM Modify (~5.7%) following. Issue text generated by LLMs matches or exceeds real GitHub issue templates (7.7% vs 7.8%), outperforming fixed templates (6.4%). Expanding repository diversity with fixed dataset size enhances performance logarithmically. Repository specialization (e.g., fine-tuning on 700 SymPy-specific tasks) confers substantial in-domain gain (42.4% vs 33.3%) with marginal drop elsewhere.

6. Practical Implementation and Resource Requirements

SWE-smith’s automated pipeline collects data at an operational cost of approximately \$1,360 (major contributors: LLM-based bug synthesis and PR mirroring, \$1,000; environment setup, \$160; issue generation, \$200 for 10,000 bugs), with per-instance expense of \$0.023. Environment storage requires 125 images/295GB; human intervention per repository is minimal (~5 mins for verification). All collection scripts, Dockerfiles, prompts, assets, and documentation are open-sourced at https://swesmith.com. Extension involves forking the codebase, supplying repositories, generating bug candidates, validating, collecting interaction trajectories, and fine-tuning LLMs via rejection-sampling fine-tuning (RFT).

7. Limitations and Prospective Directions

SWE-smith is currently Python-centric, since AST-based mutation libraries and validation pipelines are language-dependent. Extension to other programming languages would require analogous tooling. Evaluation leverages transparent test suites, but hidden test extraction (by omitting $T^-$ from environments) is feasible. Suggested advances include reinforcement learning (RL), in-context learning, inference-time verifiers, and temperature scaling to further refine training and evaluation. Open-sourcing code, environments, and data democratizes LM research in automated software engineering, lowering systemic barriers while offering reproducible experimentation (Yang et al., 30 Apr 2025).

Markdown Upgrade to Chat

References (1)

SWE-smith: Scaling Data for Software Engineering Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-smith.