SWE-CI: CI Benchmark for Code Evolution

Updated 2 July 2026

SWE-CI is a benchmark that evaluates LLM-powered agents through multi-turn CI loops, emphasizing long-term maintainability over snapshot correctness.
It operationalizes real-world repository evolution with tasks spanning an average of 233 days, hundreds of commits, and substantial code changes.
Performance is measured via EvoScore, normalized change, and zero-regression rate, offering actionable insights into agents’ ability to manage technical debt.

SWE-CI is a repository-level benchmark designed to evaluate the capability of LLM–powered agents to maintain and evolve codebases via a continuous integration (CI) loop, with a focus on long-term maintainability rather than snapshot functional correctness. Launched to address fundamental limitations in traditional, one-shot static code benchmarks, SWE-CI operationalizes a realistic multi-turn repair, refactoring, and feature iteration workflow over real-world repositories. Each task requires dozens of rounds of agentic analysis and code modification, tracing complex software evolution spanning months and hundreds of commits—the first such benchmark to systematically quantify sustained code quality and technical debt management by autonomous agents (Chen et al., 4 Mar 2026).

1. Formalization of the SWE-CI Evaluation Loop

SWE-CI defines an agentic evolution protocol over codebases $\mathcal{C}$ , requirements $\mathcal{R}$ , and a unit test suite $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ . Two oracles underpin the loop:

$\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ : Given the current codebase $c$ and a golden/reference codebase $c^*$ , outputs a requirement $r$ describing the test-based delta.
$\mathrm{code}_{\mathcal{T}}: \mathcal{R} \times \mathcal{C} \to \mathcal{C}$ : Applies requirement $r$ to $c$ , yielding new candidate $\mathcal{R}$ 0.

Unlike static snapshot benchmarks, which process $\mathcal{R}$ 1 in a single step, SWE-CI instantiates a CI loop of length $\mathcal{R}$ 2:

$\mathcal{R}$ 3

for $\mathcal{R}$ 4, halting when $\mathcal{R}$ 5 (modulo tests) or upon exhausting the iteration budget. Each benchmark task $\mathcal{R}$ 6 corresponds to a real repository $\mathcal{R}$ 7 commit span, with the agent incrementally "closing the gap" against moving targets as defined by the oracle functions (Chen et al., 4 Mar 2026).

2. Benchmark Construction and Dataset Characteristics

SWE-CI comprises 100 tasks drawn from 68 distinct Python repositories. Selection criteria ensure:

Mean evolution span: 233 days per task
Mean number of intermediate commits: 71
Minimum 1000 lines of code (LOC) changed per task (excluding test files)
Each repository: active maintenance $\mathcal{R}$ 8 3 years, $\mathcal{R}$ 9 500 GitHub stars, explicit lockfiles, nonrestrictive license, and comprehensive unit tests

The pipeline is as follows:

Repository Collection: $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 0150,000 public projects filtered to 4,923 by activity, stars, license, and tests.
Commit Span Extraction: Linearize histories, segment spans without dependency changes, and select those with $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 11,000 LOC diff ( $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 28,311 endpoint pairs).
Environment Construction: Automated Dockerization and test re-execution; missing dependencies are injected to recover environment reproducibility ( $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 31,458 viable pairs).
Filtering & Selection: Pairs are filtered for launchable tests and minimum test delta; tasks are then ranked by time span and commit count, with the top 100 chosen for the release set (Chen et al., 4 Mar 2026).

3. Evaluation Metrics for Maintainability and Evolution

SWE-CI measures an agent’s long-term software engineering competence via three quantitative metrics:

Normalized Change $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 4: Measures test progress or regression per iteration. For $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 5(test $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 6 passes on $\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 7):

$\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 8

$\mathcal{T}=\{t_1, \dots, t_{|\mathcal{T}|}\}$ 9, yielding $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 0 for progress, $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 1 for regression.

EvoScore $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 2: A future-weighted mean of $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 3 across CI iterations; for discount parameter $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 4:

$\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 5

$\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 6 biases toward later iterations, rewarding sustainable fixes.

Zero-Regression Rate: The fraction of tasks for which $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 7 at every iteration—quantifies an agent's ability to avoid regressions over long codebase evolution (Chen et al., 4 Mar 2026).

4. Comparison with Prior Static Code Benchmarks

SWE-CI introduces dimensions absent in prior static benchmarks such as HumanEval, MBPP, or SWE-bench:

Paradigm	Data Granularity	Capability Measured
Static (e.g. SWE-bench)	One-shot, single requirement	Functional correctness at snapshot
SWE-CI	Multi-turn, full repo, multi-commit	Long-term maintainability, technical debt management

SWE-CI tasks require sustained reasoning and long-horizon planning, with agents challenged to maintain or improve code quality iteratively. Unlike snapshot pass@1 metrics, SWE-CI explicitly evaluates if models "stay ahead of tests" over realistic software lifecycles (Chen et al., 4 Mar 2026).

5. Experimental Results and Observations

Eighteen LLMs from eight providers were evaluated on SWE-CI using a dual-agent protocol (Architect + Programmer), with each task capped at 20 CI iterations. Key findings include:

EvoScore Trends: Models released post-2026, especially the Claude Opus series and GLM-5, consistently outperform earlier architectures.
$\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 8-Sensitivity: Provider-level differences in weighting short- vs. long-term gains—GPT and MiniMax favor $\mathrm{require}_{\mathcal{T}}: \mathcal{C} \times \mathcal{C} \to \mathcal{R}$ 9 (late iterations), while GLM/Kimi favor $c$ 0 (early iterations).
Zero-Regression Rates: Most agents achieve zero regressions in fewer than 25% of tasks; only select Claude-Opus variants exceed 50%.

Despite overall progress, state-of-the-art LLMs still exhibit significant difficulties with regression avoidance and long-horizon maintainability planning (Chen et al., 4 Mar 2026).

6. Representative SWE-CI Workflow Example

A typical SWE-CI scenario involves:

Base commit ( $c$ 1): 12 new failing tests, three regressions
Iteration 1: Architect proposes implementing a missing utility function and patching error handling; Programmer integrates code; eight new tests pass
Subsequent iterations: Architect identifies needed refactorings or test-driven repairs; Programmer implements; intermediate $c$ 2 approaches 1 as agent closes the test gap
Final iteration: All oracle tests pass, $c$ 3, and an EvoScore with $c$ 4 reflects the contribution of later, stabilization-focused changes (Chen et al., 4 Mar 2026).

7. Limitations, Challenges, and Future Directions

Identified challenges include:

Reproducibility: Docker environments for legacy repositories may break due to native dependencies or OS drift.
Language Scope: The current release is Python-only; extension to other ecosystems remains open.
Test Coverage Variance: Varying test suite quality in source repositories introduces bias, risking underestimation of regression rates.
Agent Protocol Complexity: The dual-agent (Architect/Programmer) setup captures realistic workflows but increases system complexity.

Future extensions are anticipated to address:

Multi-language CI benchmarking (Java, JS, C++)
Incorporation of richer CI events (linting, type-checking, security analysis)
Human-in-the-loop evaluation (code reviews)
Agents trained via reinforcement learning directly optimized for EvoScore (Chen et al., 4 Mar 2026).

8. SWE-CI and the Emergence of Repository-Centric Evaluation

The inadequacy of Task-Centric Learning (TCL) for small models has led to the development of Repository-Centric Learning (RCL) paradigms, which prioritize deep vertical mastery of a single codebase over broad task distribution. Work with models such as SWE-Spot-4B demonstrates that RCL fine-tuning results in agents with superior inference efficiency and sample efficiency within evolving codebases, making them especially apt for resource-constrained and privacy-sensitive settings typical of real-world CI/CD pipelines. For SWE-CI-style tasks, RCL enables small agents to maintain performance without the extensive online search that large frontier models require, and incremental RCL offers a tractable path for adaptive repo-expert deployment as codebases change (Peng et al., 29 Jan 2026).

References:

(Chen et al., 4 Mar 2026) "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration" (Peng et al., 29 Jan 2026) "SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning"

Markdown Report Issue Upgrade to Chat

References (2)

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration (2026)

SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-CI.

SWE-CI: CI Benchmark for Code Evolution

1. Formalization of the SWE-CI Evaluation Loop

2. Benchmark Construction and Dataset Characteristics

3. Evaluation Metrics for Maintainability and Evolution

4. Comparison with Prior Static Code Benchmarks

5. Experimental Results and Observations

6. Representative SWE-CI Workflow Example

7. Limitations, Challenges, and Future Directions

8. SWE-CI and the Emergence of Repository-Centric Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SWE-CI: CI Benchmark for Code Evolution

1. Formalization of the SWE-CI Evaluation Loop

2. Benchmark Construction and Dataset Characteristics

3. Evaluation Metrics for Maintainability and Evolution

4. Comparison with Prior Static Code Benchmarks

5. Experimental Results and Observations

6. Representative SWE-CI Workflow Example

7. Limitations, Challenges, and Future Directions

8. SWE-CI and the Emergence of Repository-Centric Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research