SWE-bench lite: Automated Issue Benchmark
- SWE-bench lite is a curated benchmark of 300 self-contained Python issues designed to assess automated code repair and feature addition.
- It uses a single-submission evaluation protocol with binary pass/fail verification, exemplified by CodeR’s 28.33% issue resolution rate.
- The benchmark emphasizes structured repair workflows, combining task graphs and classical fault localization to advance software engineering automation.
SWE-bench lite is a curated benchmark for evaluating automated software engineering systems, specifically targeting the task of resolving real-world GitHub issues. It has become a central resource for measuring the current capabilities and limitations of models and frameworks that aim to perform complex, end-to-end bug fixing and feature addition in open-source codebases.
1. Definition and Scope
SWE-bench lite is a subset of the broader SWE-bench benchmark, which was originally introduced to assess automated agents' abilities to resolve authentic Python-library issues found in open-source repositories. While the full SWE-bench suite includes 2,294 issues, the "lite" variant consists of 300 carefully selected issues. These issues were chosen for their higher quality and self-contained nature, facilitating swifter and more interpretable evaluation cycles (Chen et al., 2024).
The scope of SWE-bench lite is strictly focused on Python repositories. Each issue in the set includes a description derived from real GitHub reports, the relevant repository snapshot, and—when available—a minimal failing test or clear signal of the broken or missing functionality.
2. Evaluation Protocol and Metrics
Systems evaluated on SWE-bench lite are expected to autonomously apply code changes to a repository with the goal of resolving the described issue. The standard protocol mandates a single submission per issue for each system, mirroring a realistic software engineering setting where patch retries may be constrained (Chen et al., 2024).
The resolution of an issue is determined by an automated verification pipeline:
- The model generates or edits code and, if required, supplies a test or reproducing script.
- The verification system attempts to run both the project’s built-in tests and any reproduced failures to check for resolution.
- Success is reported as a binary outcome per issue.
The primary metric is the percentage of issues resolved. For instance, the CodeR system achieves a resolution rate of 28.33% (85 out of 300 issues), outperforming a range of prior approaches as well as commercial solutions such as Devin (13.86%) and Amazon Q (≈17–20%) (Chen et al., 2024).
3. Benchmark Design Principles
SWE-bench lite’s smaller scale stems from explicit curation goals:
- Quality Control: All selected issues are self-contained and avoid requiring complex, multi-repository context, ensuring that resolution depends on reasoning and code manipulation rather than large-scale context retrieval.
- Evaluation Efficiency: The dataset size permits rapid benchmarking and supports detailed ablation studies.
- Reproducibility: Every issue in the lite suite is fully specified and can be evaluated under isolated, deterministic containerized environments (Chen et al., 2024).
A plausible implication is that performance on SWE-bench lite is a strong proxy for real-world, high-quality automated software engineering in controlled scenarios, though not necessarily for open-ended or cross-repo tasks.
4. Comparison with Related Benchmarks
SWE-bench lite is closely related to, but distinct from, both the original SWE-bench and other code understanding or generation benchmarks:
- Unlike traditional code synthesis tasks (e.g., HumanEval, MBPP, etc.), which focus on translating a function description to code, SWE-bench lite targets the holistic repair and evolution of mature, multi-file codebases.
- In contrast to benchmarks such as CodeRAG or RepoEval, which center on retrieval or single-turn completion, SWE-bench lite measures full-cycle issue resolution, including multi-step editing, test creation, and integration into the project’s test suite (Li et al., 19 May 2025).
5. Impact and Results in Recent Research
The SWE-bench lite benchmark is used as a core evaluation for advanced automated code agents, such as CodeR—a multi-agent, task-graph driven framework. In CodeR’s evaluation, SWE-bench lite enables:
- Large-scale, head-to-head comparison with single-agent, RAG-based, as well as commercial and open-source systems.
- Fine-grained ablations, such as the impact of multi-agent design and explicit task graph planning, which show a marked increase in solution rate (22.0% for full CodeR, dropping to 10.0% without multi-agent/task-graph and to 14.0% if fault-localization is removed) (Chen et al., 2024).
- Cost, efficiency, and resource comparisons (e.g., average API calls, token usage, and financial cost per issue resolved).
The benchmark’s use in the CodeR study demonstrates that combining classical fault localization, structured plan execution, and LLMs can achieve a significant leap in empirical resolution rates, highlighting the value of task decomposition and explicit planning over implicit, end-to-end LLM prompting (Chen et al., 2024).
6. Limitations and Future Directions
While SWE-bench lite offers a reproducible, high-quality platform for benchmarking, certain limitations are inherent:
- The repository and issue set is restricted to Python and does not address cross-project or multi-language code changes.
- Many real-world issues that depend on external systems, broader context, or involve non-code artifacts (e.g., data files, documentation) are out of scope.
- Its deterministic, single-submission protocol may obscure the benefits of interactive, iterative problem-solving frameworks.
Future directions for SWE-bench and related benchmarks include expanding coverage to additional languages, representing more diverse software maintenance scenarios, and integrating mechanisms for multi-turn human-agent collaboration and plan induction (Chen et al., 2024).
7. Significance in Software Engineering Automation
SWE-bench lite has emerged as a de facto standard for quantitative measurement in end-to-end software engineering automation. Its impact is reflected in the rapid iteration of frameworks tailored for the specific challenges it presents, such as:
- Multi-agent collaboration with explicit delegation of responsibilities (e.g., reproduction, fault localization, editing, verification).
- Use of pre-defined task graphs to enforce structured repair workflows and reduce LLM reasoning load.
- Systematic inclusion of classical software engineering techniques with advanced LLM-based generation and retrieval.
By focusing on difficult, self-contained, and practically relevant code issues, SWE-bench lite drives the development and validation of generalizable, fault-tolerant, and efficient automated agents for software maintenance and evolution (Chen et al., 2024, Li et al., 19 May 2025).