SWE-bench: Execution-Based Software Evaluation

Updated 26 October 2025

SWE-bench is an execution-based evaluation framework that tests language models on realistic software engineering tasks derived from actual GitHub issues and pull requests.
The benchmark requires generating patch files for multi-file, multi-step code modifications that fix failing tests while preserving overall system behavior.
It employs rigorous data filtering, execution validation, and advanced retrieval techniques to highlight current LM limitations in long-context and repository-scale reasoning.

SWE-bench is an execution-based evaluation framework for assessing the ability of LMs to resolve real-world software engineering problems derived from actual GitHub issues and pull requests. Each task in SWE-bench requires modifying a full codebase—often thousands of files and hundreds of thousands of lines—by generating a patch file that, when applied, changes the program’s behavior so that previously failing tests pass while all other tests continue to pass. The benchmark is designed to push beyond traditional code generation settings, requiring reasoning across multiple functions, classes, and source files, and necessitating the management of extremely long input contexts combined with complex repository-scale reasoning. SWE-bench exposes critical limitations of current LMs, as leading models have only been able to resolve a small fraction of these issues, underlining the need for advances in long-context modeling, multi-file reasoning, and robust automated testing (Jimenez et al., 2023).

1. Dataset Construction and Task Formulation

SWE-bench is built from 2,294 curated software engineering tasks, each extracted from one of 12 widely used open-source Python repositories. The data construction pipeline consists of three key filtering stages: (1) scraping all merged pull requests, (2) attribute-based filtering (retaining only those resolving a linked issue and affecting test files), and (3) execution-based validation by applying the patch to the base commit and verifying that at least one test transitions from failing to passing, while no previously passing tests regress. This ensures that each benchmark task is grounded in a realistic development scenario and is verifiable by test execution.

Each benchmark instance provides the following:

Context: An issue description (mean length ≈ 195 words) and a snapshot of a repository (potentially thousands of files).
Output: A patch (diff file) expected to be formatted in standard Unix diff notation, syntactically correct and directly applicable via tools like patch.
Evaluation metric: Binary instance-level metric; a solution is accepted if the patch applies cleanly and the required test transitions are observed when the test suite is run. Formally, letting $\mathcal{P}$ denote the proposed patch and $\mathcal{T}$ the union of fail-to-pass and pass-to-pass tests,

$\mathbb{I}\Big\{\text{patch\_applies}(\mathcal{P}) \land \forall t \in \mathcal{T}: \text{run\_test}(\mathcal{P}, t) = \text{pass}\Big\} = 1$

where the indicator function $\mathbb{I}\{\cdot\}$ formalizes binary task resolution.

A retrieval step is included to address the codebase’s prohibitive length: either a sparse retriever (BM25) or an "oracle" retriever (files directly referenced in the ground-truth patch) selects files for model input. Both approaches impose significant context size and noise challenges on LMs.

2. Problem Spectrum and Required Capabilities

SWE-bench covers a broad range of real-world software engineering issues, such as:

Bug fixes that require precise localization of defects across a large codebase.
Feature implementations and enhancements that necessitate coordinated edits spanning several classes, functions, and modules.
Code style corrections, logical consistency changes (e.g., enforcing type conversion discipline), and modifications dependent on inter-module relationships.

The complexity of SWE-bench instances is characterized by ground truth patches that, on average, touch 1.7 files, 3 functions, and around 33 lines. This distribution necessitates reasoning across software architecture boundaries, not just local file-level edits. Many problems in SWE-bench involve subtle or multi-step dependencies where the practical fix is not straightforward, and the test coverage must robustly distinguish correct from incomplete or incorrect patches.

3. Methodological Challenges for LLMs

SWE-bench is specifically designed to stress several axes of LM capabilities:

Long-context and Noise Management: Input prompts often reach tens or hundreds of thousands of tokens. Models must locate the critical subset of lines/files for editing ("needle in a haystack” scenario). Notably, increasing retrieval window improves relevant file recall but degrades end-to-end performance due to overload and distraction from spurious context.

Coordinated Multi-file Edits: Effective patching frequently requires synchronized changes to code in multiple files or modules. Successful models must manage diff headers, line numbers, and patch formats while correctly updating all interdependent locations.

Reasoning Beyond Syntax: Tasks may demand non-obvious semantic reasoning—for example, fixing subtle type conversion bugs affecting cross-module behavior or handling multi-case logic. Furthermore, even small code changes can have global side effects, and the model's solution must robustly generalize across all covered test cases.

Evaluation Results: Strong proprietary models and fine-tuned open-source LMs perform poorly. Claude 2 solves only 1.96% of SWE-bench BM25-retrieval tasks; ChatGPT-3.5 achieves 0.17%; GPT-4, on a subset, resolves none. SWE-Llama (7B/13B, fine-tuned on CodeLlama-Python) achieves around 0.70% (BM25), sometimes better with oracle file selection. While 43–53% of generated patches by Claude 2 and SWE-Llama are syntactically valid, only a small fraction pass all required execution tests. Main limitations include greedy or trivial edits that miss the multi-file, multi-site nature of real fixes, failure to preserve existing behavior, and degradation of performance with increased input size.

4. Evaluation Pipeline and Benchmarking Protocol

SWE-bench’s evaluation protocol is deeply execution-centered: patch validity is not just syntactic but also functional, determined via test transitions. Generated patches are applied to the original codebase in a sandboxed environment, and the modified test suite is run to validate the fail-to-pass transitions and check for pass-to-pass regressions.

Retrieval strategies play a significant role in input construction:

Retrieval Mode	Methodology	Impact on LM Performance
BM25 (Sparse)	Keyword-based,	High noise, modest recall;
	rank N files	performance drops as context grows
Oracle	Direct file matches	Lower context size, higher recall
	from gold patch	but only feasible for evaluation

Noise inherent in BM25 retrieval illustrates the necessity for context-sensitive and efficient file selection and suggests why brute-force large-context ingestion remains ineffective.

5. Limitations, Research Impact, and Extensions

SWE-bench highlights the following methodological and practical limitations:

Mainstream and fine-tuned LMs are limited to single-site, shallow repairs, unable to replicate the nuanced, multi-file edits characteristic of competent human fixes.
Test suites and evaluation pipelines in the benchmark are susceptible to quality concerns such as incomplete test coverage or ambiguous issue descriptions if improperly constructed (this point has motivated follow-on work in SWE-bench+ and related iterations).
The static nature and limited repository coverage of the original SWE-bench (12 repositories) can lead to overfitting or distribution mismatch—a challenge addressed by later extensions such as SWE-bench-Live and SWEE-Bench.

Nonetheless, SWE-bench provides a rigorous, large-scale, and realistic testbed for assessing model practical utility in repository-scale software engineering—a milestone not previously attained by smaller, synthetic code generation benchmarks.

6. Future Directions and Open Challenges

SWE-bench research outlines several avenues for advancing the field:

Long-context modeling and retrieval: New architectures or retrieval methods are needed to efficiently extract and encode only those code fragments crucial for the issue at hand, rather than naively expanding the context window.
Execution-driven feedback and multi-step agents: Integration of patch/test execution in-the-loop allows for iterative refinement of proposed solutions. Robust agentic workflows that iteratively refine, verify, and backtrack on patches may outperform single-shot generation.
Structured patch planning and evaluation: Beyond correct syntax and semantics, models must learn to strategically orchestrate patches spanning many files and architectural boundaries.
Benchmark evolution: More repositories and more challenging, contamination-resistant problem instances are required. Addressing test adequacy and solution leakage issues, as well as extension to additional languages, will make the benchmark continuously relevant for model development and assessment.

In summary, SWE-bench represents a critical inflection point in the evaluation of machine intelligence for program repair, exposing both the promise and the limitations of current LMs as autonomous software engineers and motivating continued research into long-context code reasoning, multi-file edit planning, and execution-driven validation.

PDF Markdown Chat (Pro)

References (1)

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SWE-bench.