Papers
Topics
Authors
Recent
2000 character limit reached

SWE-Perf Benchmark Overview

Updated 12 October 2025
  • SWE-Perf Benchmark is a framework for systematic evaluation of code optimization using real-world instances, reproducible Docker environments, and standardized metrics.
  • It integrates automated environment setups and continuous benchmarking to track performance regressions and simulate repository-level assessments.
  • Empirical findings reveal that LLM-based agents underperform human experts, exposing scaling limitations and significant performance gaps in automated code improvements.

Performance benchmarking in software engineering quantifies and compares the efficiency of systems, frameworks, and code modifications using standardized metrics and reproducible evaluation methodologies. The SWE-Perf Benchmark is a suite of recent methodologies, datasets, and protocols developed for systematically evaluating code performance optimization and non-functional properties at the repository level, especially in the context of LLM-based agents and automated software improvement.

1. Conceptual Foundations and Motivation

Efficient and fair performance benchmarking has become central to both mathematical software assessment and to evaluating autonomous code improvement agents. Early frameworks, such as the performance profile (Hekmati et al., 2018), characterize software performance across a suite of problems by defining, for each solver ss and problem pp, the performance ratio:

rp,s=tp,smin{tp,s:sS}r_{p,s} = \frac{t_{p,s}}{\min\{ t_{p,s'} : s' \in S \}}

with tp,st_{p,s} denoting execution time (or another performance measure). The cumulative function

ρs(τ)=1np{pP:rp,sτ}\rho_s(\tau) = \frac{1}{n_p} \left| \left\{ p \in P : r_{p,s} \leq \tau \right\} \right|

encodes the probability that solver ss is within a factor τ\tau of optimal for any problem in PP. However, assessment based solely on the best solver introduces ranking inconsistencies if the top solver is removed.

To address these inconsistencies, the nested performance profile extends the classical approach by iteratively removing the current best solver and recomputing ρs(τ)\rho_s(\tau), then aggregating these “waves” to yield an unbiased solver ranking. Such methods directly influence how modern SWE-Perf benchmarks design multi-candidate and multi-agent assessments (Hekmati et al., 2018).

2. Benchmark Design and Components

SWE-Perf benchmarks are constructed from real-world codebases and rigorous curation pipelines targeting repository-level evaluation (He et al., 16 Jul 2025). Key design elements include:

  • Instance Construction: Tasks are derived from performance-improving pull requests identified by parsing commit histories from popular open-source repositories (e.g., over 100K PRs filtered down to 140 instances in SWE-Perf (He et al., 16 Jul 2025), or 102 tasks in GSO (Shetty et al., 29 May 2025)).
  • Machine-Executable Environments: Each instance is delivered in a reproducible Docker container, isolating all dependencies and ensuring hardware-level consistency. Environments are standardized (e.g., single CPU core, 16 GB RAM) (He et al., 16 Jul 2025).
  • Test Suite Curation: Rather than re-running all repository tests, performance-related or fail-to-pass unit tests are harvested to target the code segments touched by optimization patches. Task tuples include (R,T,I,E,S,X)(R, T, I, E, S^*, X^*)—codebase, test suite, issue description, environment, reference test patch, and reference code patch (Vergopoulos et al., 10 Mar 2025).
  • Gold-Standard Patches: Each task is associated with an expert-generated patch (from the developer’s PR), which both verifies that performance improvement is feasible and establishes a reference for comparative evaluation.

3. Evaluation Methodology and Metrics

Evaluation of code optimization agents in SWE-Perf hinges on rigorous multi-level metrics:

a) Patch Application (Apply Metric)

Assesses whether the model-generated patch integrates without conflict:

Apply=NapplyNtotal\text{Apply} = \frac{N_{apply}}{N_{total}}

b) Correctness Verification

Confirms all targeted tests pass post-patch:

Correctness=#{cases where j=1Niresulti,jpost=pass}Ntotal\text{Correctness} = \frac{\#\{\text{cases where}~\wedge_{j=1}^{N_i} \text{result}_{i,j}^{post} = \text{pass}\}}{N_{total}}

c) Quantitative Performance Gain

Calculates performance improvement as the maximal statistically significant (p<0.1p < 0.1 in a Mann–Whitney U test) reduction δ\delta in runtime across 20 repeated test executions, with outlier filtering by IQR. The gain δ\delta is computed as the maximum xx for which multiplying patched runtimes by (1x)(1 - x) remains significant.

d) Oracle and Realistic Settings

  • Oracle/File-Level: The agent receives oracle knowledge of target functions, isolating optimization skill.
  • Repo-Level/Realistic: The agent must infer optimization targets from runtime traces, simulating practical scenarios.

The harmonic mean of per-test speedups is used in GSO (Shetty et al., 29 May 2025):

S(C1,C2)=ni=1nT(C2,i)/T(C1,i)S(C_1, C_2) = \frac{n}{\sum_{i=1}^n T(C_2, i) / T(C_1, i)}

with success determined if both functional correctness holds and S(Ch,Ca)pS(C_h, C_a) \geq p relative to the human commit.

4. Tools, Automation, and Continuous Integration

Modern SWE-Perf evaluations integrate automation and industry best practices for reproducibility and scalability:

  • Automated Environment Setup: Systems like SETUPAGENT (Vergopoulos et al., 10 Mar 2025), RepoLaunch (Zhang et al., 29 May 2025), and agentless LLM pipelines (Badertdinov et al., 26 May 2025) scan repository configuration files (README, Dockerfile, setup.py, CI/CD configs) to generate, test, and refine environment construction and installation instructions, minimizing manual curation and accommodating dynamic package drift (Zhang et al., 29 May 2025).
  • Continuous Benchmarking: Strategies adapted from HPC frameworks constantly trigger performance benchmarks upon code commits, utilizing time-series DBs (InfluxDB), visualizations (Grafana), and robust archiving (Kadi4Mat) to track performance regressions and improvements over time (Alt et al., 3 Mar 2024).
  • Reproducibility Protocols: Each instance environment is containerized and versioned to ensure that all code, dependencies, and runtime measurement are faithfully reproduced at evaluation time.

5. Empirical Findings and Capability Gaps

Comprehensive experimentation across SWE-Perf and analogous benchmarks has revealed critical insights:

  • Substantial Agent–Expert Performance Gap: Model-generated patches yield mean gains of only 0.4–1.7% compared to expert-derived improvements of ~10.85%. Even advanced agent-based or multi-step models (e.g., OpenHands) leave a persistent 8.59% gap relative to human-level optimization (He et al., 16 Jul 2025).
  • Scaling Limitations: Adding more agent computation (more interaction rounds or agent trajectories) yields modest relative improvements, rarely exceeding 10–20% success across even extended runs on the most challenging tasks (Shetty et al., 29 May 2025).
  • Specific Failure Modes: Leading LLM-based agents frequently:
    • Struggle to localize actual performance bottlenecks (often misoptimizing irrelevant code regions).
    • Apply trivial optimizations that do not generalize (modifying compiler flags or exploiting test input specifics).
    • Encounter severe difficulty with low-level languages (C, Cython), sometimes introducing segmentation faults (Shetty et al., 29 May 2025).
  • Distributional Shifts and Fairness: Legacy benchmarks (e.g., SWE-bench) are now recognized as prone to contamination and overfitting; new benchmarks such as SWE-bench-Live, SWEE-Bench, and SWE-rebench continuously ingest fresh examples and previously unseen repositories, resulting in a halved success rate for leading models compared to static evaluations (Zhang et al., 29 May 2025, Badertdinov et al., 26 May 2025).

6. Benchmark Evolution, Limitations, and Recommendations

The field has undergone significant maturation regarding dataset construction and evaluation methodology:

  • Contamination Risks: Studies have demonstrated that impressive performance on widely reused and static benchmarks (SWE-bench-Verified) may be inflated by memorization—models achieve high accuracy at file localization or function reproduction in ways correlating strongly with repository inclusion in pretraining data (Liang et al., 14 Jun 2025).
  • Broadening Coverage: Addressing these concerns, recent benchmarks enforce temporal and repository origin separation, enrich language coverage (Multi-SWE-bench: Java, TypeScript, JavaScript, Go, Rust, C, and C++ (Zan et al., 3 Apr 2025)), and emphasize performance, memory, and other non-functional metrics (Garg et al., 28 Sep 2025, Blot et al., 2022).
  • Representative and Challenging Tasks: Benchmarks are now curated to reflect increased code and fix complexity, multi-module optimizations, and diverse bug types, capturing memory management, concurrency, algorithmic, and system-level inefficiencies (Shetty et al., 29 May 2025, Garg et al., 28 Sep 2025).
  • Evaluation Protocols: Rigorous, standardized, and reproducible evaluation pipelines—agentless methods, open-source agentic scaffolds, execution-based verification, and meta-metric tracking (e.g., pass@k, outlier control)—are now common best practice.

7. Future Directions and Research Opportunities

Ongoing and future research directions for SWE-Perf benchmarking include:

  • Dataset Extension: Expanding the coverage to more codebases, languages, and broader categories of performance bugs and optimization targets, moving beyond file-level to holistic, repository-wide and multi-language scenarios (He et al., 16 Jul 2025, Shetty et al., 29 May 2025).
  • Integration of Learning Techniques: Incorporating cross-benchmark validation, reinforcement learning with large-scale, diverse RL training datasets (see Multi-SWE-RL (Zan et al., 3 Apr 2025)), and hybrid verifier architectures that combine execution-based and execution-free assessments (Jain et al., 9 Apr 2025).
  • Improved Agent Design: Developing agents capable of high-level reasoning, long-horizon planning, and robustly learning to generalize optimizations beyond input-specific or trivial patterns—addressing reward hacking, testing manipulation, and opportunistic exploits.
  • Holistic Optimization: Moving beyond execution time to include memory optimization, energy usage, and maintainability as simultaneous optimization targets in the evaluation, encouraging multi-objective autoimprovement and robust regression avoidance (Blot et al., 2022).
  • Benchmark Reproducibility and Sustainability: Ensuring continuous supply of fresh, uncontaminated test cases, leveraging agentless automation, and publishing both code and infrastructure to allow ongoing extension by the research community (Badertdinov et al., 26 May 2025).

In aggregate, the SWE-Perf Benchmark framework and its evolving methodologies underlie a new standard for evaluating automated code improvement, rigorously tracking the real-world impact of LLM-based engineering agents, and addressing both functional and non-functional software properties with scientific precision and transparency.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SWE-Perf Benchmark.