SWE-Dev Dataset: Agentic Software Benchmarks

Updated 2 June 2026

SWE-Dev is a collection of large-scale datasets that benchmark agentic software engineering through realistic, containerized task suites.
Each variant employs LLM-driven synthetic pipelines and reproducible environments to simulate integration, debugging, and feature-driven development challenges.
The datasets facilitate evaluation using metrics like Pass@k and Instruction Following Rate, driving advances in autonomous software development research.

The SWE-Dev dataset refers to several distinct large-scale, executable benchmarks in software engineering, most commonly denoting (1) a dev set associated with the APEX-SWE benchmark for measuring agentic software engineering capabilities (Kottamasu et al., 13 Jan 2026), (2) a platform focused on large-scale, annotated agent trajectories and synthesized test suites for Python bug-fixing (Wang et al., 9 Jun 2025), and (3) a dataset targeting autonomous feature-driven development evaluation at scale (Du et al., 22 May 2025). Each variant advances research on autonomous software development and evaluation of LLMs and multi-agent systems by constructing verifiable, real-world task suites and supporting reproducible, containerized execution.

1. Data Scope and Composition

The SWE-Dev datasets serve as benchmarks for diverse software engineering tasks, varying in design and primary objective:

A. APEX-SWE Dev Set ("SWE-Dev") (Kottamasu et al., 13 Jan 2026):

50 tasks: 25 integration (“greenfield” end-to-end system construction) and 25 observability (diagnose & fix production bugs).
Realistic contexts, with integration tasks requiring orchestration of services such as AWS LocalStack, EspoCRM, MailHog, Medusa, Zammad, Mattermost, and PostgreSQL.
Observability tasks are derived from production bug reports (GitHub issues→PRs, containerized environments, synthetic logs/dashboards).

B. SWE-Dev for Agent Training and Bugfix (Wang et al., 9 Jun 2025):

38,000 issue→patch instances from 4413 public Python repositories (≥5 GitHub stars, ≥3 PRs).
Each instance: issue description, pre-patch source snippet, gold patch, test suite (synthesized and extracted), and 1–N agent trajectories.
4630 executable unit-test files (2,097 synthesized “fail-to-pass” pytests; remainder extracted from PRs).
19,300 agent trajectories (multi-step ReAct/OpenHands rollouts).

C. SWE-Dev for Feature-Driven Development (Du et al., 22 May 2025):

14,000 training and 500 test instances (250 easy, 250 hard), Python-only.
Tasks from 1,086 PyPI packages: per-task averages of 64 source files (20,200 LOC), PRD length 1,833 tokens, ~200 LOC ground-truth edit, ~6 unit tests.
Each repo is containerized for strict reproducibility.

Dataset Variant	Instances/Tasks	Domain	Task Focus
APEX-SWE Dev Set (Kottamasu et al., 13 Jan 2026)	50	Mixed	Integration/Debug
SWE-Dev (Agent Bugfix) (Wang et al., 9 Jun 2025)	38,000	Python	Bug fixing
SWE-Dev (Feature Dev) (Du et al., 22 May 2025)	14,500 (14k+500)	Python	Feature addition

2. Task Design and Underlying Motivation

Integration Tasks in the APEX-SWE dev set simulate greenfield (from-scratch) construction of cloud-centric business workflows, emphasizing realistic interoperation across APIs and infrastructure-as-code. These are designed by experienced engineers based on common industry patterns, and stacked environments such as AWS emulation and enterprise SaaS products are provided for full-fidelity validation.

Observability Tasks simulate the complex, information-incomplete debugging flows that dominate production engineering, using logs, dashboards (Grafana/Loki), and fragments of chat/discussion for context. Each task is locked down via a full, containerized snapshot, including failing and gold-standard patches.

The SWE-Dev datasets for bug-fix and FDD are inherently grounded in practical, developer-authored source code and actual or synthesized test suites. The bug-fix corpus (Wang et al., 9 Jun 2025) centers on the issue→patch workflow, while FDD (Du et al., 22 May 2025) masks complete features (6 functions/task on average) and enforces verification through authentic unit tests.

3. Data Generation, Test Synthesis, and Trajectory Curation

The SWE-Dev corpus uses LLM-driven synthetic test pipelines for expanding coverage beyond the limitations of human-authored tests. The test-generation methodology (Wang et al., 9 Jun 2025) entails:

Extracting (i) bug-report text, (ii) faulty function snippet, (iii) import/dependency metadata.
Transforming issues into one or more Gherkin “Given–When–Then” scenarios via Llama-3.1-70B-Instruct.
Compiling scenarios into pytest functions using Qwen2.5-Coder-32B.
Executing synthesized tests to ensure they fail pre-patch and pass post-patch, discarding those not meeting both criteria.
Selection based on a sharp threshold: candidates with fail→pass accuracy ≥0.9 (or probabilistically weighted via $P_{acc}$ for lower scores).

Agent trajectories leverage ReAct-style iterations under OpenHands, producing multi-action (exec, patch, etc.) rollouts examined both for reward-weighted fine-tuning and as negative examples in offline RL protocols (KTO, OREO).

The feature-driven variant constructs environments per-task in Docker, verifying test-harness integrity before adding to the dataset. Masked features match PRD prompts, and only tasks for which all original tests pass on ground truth are included (Du et al., 22 May 2025).

4. Data Structures and Schema

APEX-SWE dev set (Kottamasu et al., 13 Jan 2026): Each task is a JSON object with fields:

task_id, task_type, prompt.md, context_files (multi-file), test_driver, test_patch, golden_patch.
Integration: Code/configuration snippets, working deployment passing side-effect unit tests.
Observability: Patch files (.diff) for regression-free fixes, synthetic logs, full environment specs.

Bug-fix & Feature instances:

One JSON per issue/task, with fields for repo URL, issue or PRD text, source snippet, reference patch, associated tests, and cyclomatic complexity.
Trajectories: action sequence, final score, and metadata.

All container environments, test execution traces, and reproducible instructions are provided to guarantee on-demand re-evaluation.

5. Evaluation Protocols and Metrics

The principal evaluation metric for SWE-Dev variants is Pass@k — the empirical probability that at least one of $k$ sampled generations passes all executable correctness checks. The formula used: $\text{Pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where $n$ is number of samples, $c$ is count of passing generations. Aggregate Pass@1 and Pass@3 are reported.

Other critical metrics:

Instruction Following Rate (IFR) over filesets in FDD (Du et al., 22 May 2025).
RL reward as test-pass rate (fraction of test assertions satisfied).
Classical classification metrics (precision/recall/F1) for test-case “fail→pass” classification (Wang et al., 9 Jun 2025).
Cost and call counts for multi-agent system (MAS) benchmarking in FDD.

All metrics are strictly computed with containerized, developer-authored or LLM-synthesized test suites, guaranteeing factual reproducibility.

6. Statistical Properties and Coverage

The SWE-Dev family yields a diverse, challenging dataset space:

Bug-fix corpus (Wang et al., 9 Jun 2025): 38k issues, mean input length ≈4,100 tokens (σ²≈(1,400)²), patches at μ=15 LOC, pre-patch complexity μ≈8.1.
FDD (Du et al., 22 May 2025): avg. 64 source files/repo, 20k LOC per sample, 6 functions/feature, ≈6 tests/sample.
APEX-SWE dev set (Kottamasu et al., 13 Jan 2026):
- Integration: stacks include LocalStack (56%), EspoCRM (35%), MailHog (33%), Mattermost (32%), Medusa (31%), Zammad (26%).
- Observability: language coverage — Go (30%), Python (25%), TypeScript (25%), Java (10%), C++ (10%); ~615 log lines/task.

Feature development and complex integration/observability set new frontiers for agentic LLM evaluation, revealing substantial unsolved challenges; for instance, state-of-the-art models achieve only 21–29% Pass@3 on hard FDD tasks (Du et al., 22 May 2025).

7. Licensing, Access, and Usage

All SWE-Dev variants are fully open-source:

APEX-SWE dev set: Released under CC-BY 4.0, accessible via HuggingFace Datasets (mercor/APEX-SWE, split="dev"), with scripts/evaluation at https://github.com/Mercor-Intelligence/apex-evals (Kottamasu et al., 13 Jan 2026).
SWE-Dev for bug-fix: Apache 2.0 license, hosted at https://github.com/THUDM/SWE-Dev (Wang et al., 9 Jun 2025).
SWE-Dev for feature development: Available at https://github.com/justLittleWhite/SWE-Dev (Du et al., 22 May 2025).

All datasets provide Docker configurations and code for test execution, enabling direct reproducibility. Python code snippets for installation, loading, and batch evaluation are included in documentation. Dataset schemas are published with executable environments and include sample notebooks for both inference and training.

The SWE-Dev suite thus constitutes a foundational resource for advancing empirical research on LLM-based software engineering, providing data at scale, strict verification regimes, and environments for reliable, fine-grained benchmarking (Kottamasu et al., 13 Jan 2026, Wang et al., 9 Jun 2025, Du et al., 22 May 2025).

Markdown Report Issue Upgrade to Chat

References (3)

APEX-SWE (2026)

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling (2025)

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Dev Dataset.