Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE-bench-Live Benchmark Suite

Updated 31 December 2025
  • SWE-bench-Live is a continuously updated benchmark suite that automates task generation and evaluation for LLMs on real-world software tasks.
  • Its architecture employs Docker images for reproducibility, automated issue-PR crawling, and realistic query mutation to mirror developer challenges.
  • Evaluation metrics such as resolved, apply, and localization rates reveal significant performance drops compared to static benchmarks.

SWE-bench-Live is a continuously updated, contamination-resistant benchmark suite for evaluating LLMs and agentic workflows on real-world software engineering tasks. The framework advances static benchmarks such as SWE-bench by automating task generation, environment setup, and evaluation using fresh, post-2024 GitHub issues, enabling rigorous assessment of agent capabilities under real, evolving conditions (Zhang et al., 29 May 2025). Its architecture incorporates monthly scaling, expansive repository coverage, and instance-level reproducibility, providing granular insight into model performance and generalization in dynamic software contexts.

1. Motivation and Benchmark Evolution

Traditional code-fixing benchmarks like SWE-bench face three core limitations: staleness due to infrequent updates, narrow repository coverage increasing susceptibility to memorization, and substantial manual overhead in task curation and environment management (Jimenez et al., 2023). SWE-bench-Live resolves these by (a) ingesting only issues filed after Jan 1, 2024, (b) expanding coverage to 93 repositories (from 2,609 candidate projects), and (c) implementing a fully automated curation and validation pipeline ("RepoLaunch") (Zhang et al., 29 May 2025, Adamenko et al., 15 Jul 2025). Tasks are coupled with dedicated Docker images to enforce reproducible execution and robust contamination filtering.

2. Automated Task Pipeline and Architecture

The core SWE-bench-Live construction pipeline ("RepoLaunch") comprises three tightly orchestrated stages:

  • Issue–PR Crawling: Select active Python repositories (≥ 1,000 stars, appropriate license) via GitHub API; extract recent issues (post-2024) and link them to corresponding merged PRs using improved heuristics over SWE-bench's string matching (Zhang et al., 29 May 2025, Adamenko et al., 15 Jul 2025).
  • Instance Generation and Validation: For each linked pair, reset the repository to the base_commit, apply the ground-truth patch, and verify transition of failing tests (FAIL_TO_PASS) to passing without inducing regressions (PASS_TO_PASS). Only instances showing deterministic, reproducible behavior are retained.
  • Automated Environment Setup: Two LLM agents (Setup and Verification) operate within a container, iteratively installing dependencies, debugging install/test failures, choosing the appropriate test command, and finalizing the runtime snapshot as a Docker image—constrained by a time-machine proxy to ensure all dependencies match the patch's historical context (Zhang et al., 29 May 2025, Adamenko et al., 15 Jul 2025).

Each SWE-bench-Live entry comprises:

Field Description Notes
base_commit Git SHA for code snapshot Ensures temporal contamination control
patch/test_patch .diff format code/test changes Applied and verified per instance
problem_statement Original GitHub issue May be further mutated for realism
FAIL_TO_PASS/PASS_TO_PASS List of test transitions Used for correctness verification
image_key Docker image snapshot Ensures instance reproducibility
test_cmds/log_parser Per-task test instructions Heuristic or agent-derived

3. Evaluation Methodology and Metrics

Benchmarking on SWE-bench-Live distinguishes itself by rigorous contamination reduction and discriminative metric design. Tasks from repositories used in SWE-bench are directly compared with those from novel sources; performance is further stratified by issue recency, patch scope (single vs. multi-file, line count), and repository scale (> 500 files) (Zhang et al., 29 May 2025, Adamenko et al., 15 Jul 2025, Garg et al., 10 Oct 2025). Key metrics include:

  • Resolved Rate: success_rateLive=#tasks solved on Livetotal tasks\mathrm{success\_rate}_{Live} = \frac{\#\text{tasks solved on Live}}{\text{total tasks}}
  • Apply Rate: Fraction of model-generated patches that cleanly apply
  • Localization Rate: Fraction where the agent selects the correct files to edit
  • Pass@k: Number solved in ≤ k attempts / N; binomial confidence intervals used for stability analysis (Adamenko et al., 15 Jul 2025)
  • Overestimation Delta: ΔSR(M)=SRbaseline(M)SRlive(M)\Delta_{SR}(M) = SR_{baseline}(M) - SR_{live}(M), quantifying performance drop when switching from formal to realistic queries (Garg et al., 10 Oct 2025)

Notably, in the inaugural release (1,319 instances), even top agent/model pairs (e.g., OpenHands / Claude 3.7 Sonnet) achieve only ≈ 19–22% resolved rate—less than half of their static SWE-bench performance, with multi-file/large fixes dropping success below 10% (Zhang et al., 29 May 2025). Table below illustrates this effect:

Agent/Model Resolved (%) SWE-bench Verified (%)
OpenHands / Claude 3.7 19.25 43.20
SWE-Agent / GPT-4.1 18.57 >40

4. Realistic Query Mutation and Robust Agent Evaluation

Empirical analysis of IDE-embedded chat telemetry demonstrates formal benchmarks dramatically overstate agent bug-fixing capabilities (Garg et al., 10 Oct 2025). SWE-bench-Live incorporates a mutation framework that rewrites issue statements into short, developer-style queries using 11 extracted templates (e.g., “Paste Error/Stack Trace Only,” “Direct 'Fix this'”), selected statistically to match real-world asking patterns. Mutated instances reveal:

  • Performance Drop: GPT-4.1 sees −36.5% relative decrease (SR_base=35.6%, SR_live=22.6%)
  • Robustness Demand: Agents must retrieve additional context, ask clarifying questions, and handle ambiguous/missing information for reliable resolution

This design paradigm aligns SWE-bench-Live with realistic agent use, reducing overfitting and promoting genuine generalizability.

5. Comparison to Other Dynamic Benchmarks and Quality Assurance

SWE-bench-Live shares core principles with dynamic benchmarks such as SWE-MERA, featuring monthly automated task harvesting, leak detection via n-gram overlap, and aggressive test quality validation using LLM scoring pipelines (Adamenko et al., 15 Jul 2025). SWE-MERA reports pass@6 rates of ≲40% for state-of-the-art models—demonstrating substantial room for advancement before saturation:

Model Static SWE-bench Pass@6 SWE-MERA Pass@6
DeepSeek-R1 58.0% 40.2%
Qwen3-32B 35.0% 26.1%

Both platforms advocate for rolling evaluation windows, frequent contamination audits, Docker/Conda validation, and public transparency via leaderboards and GitHub submission interfaces.

6. Practical Implementation and Extensibility

SWE-bench-Live is deployable via Python API or CLI, with configuration for repository/issue selection, model backend, retriever type (BM25, oracle), context window, and timeouts. The system supports extension to additional programming languages (Java, Go, JS), custom test frameworks, neural retrievers, and integration into CI/CD pipelines (Jimenez et al., 2023). Guidelines for sustainability include regular pipeline automation, community-contributed task sources, and public release of evaluation/logging scripts (Adamenko et al., 15 Jul 2025).

7. Impact, Limitations, and Future Directions

SWE-bench-Live sets a new standard for realistic, contamination-resistant agent and LLM evaluation in software engineering. Its monthly updates and scalable validation allow for benchmarking in line with live open-source development. Limitations include infrastructure maintenance overhead, selection bias toward popular repositories, and potential for automated grading to miss nuanced alternate solutions (Adamenko et al., 15 Jul 2025, Zhang et al., 29 May 2025). Future enhancements target:

  • Expanded language/domain support
  • Advanced contamination metrics
  • Direct integration with agent improvement pipelines
  • Dynamic leaderboards and transparent metrics for community engagement

This approach positions SWE-bench-Live as a critical asset for model selection, benchmarking, and continual agent evolution in practical, production-aligned engineering workflows.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-bench-Live.