SWE-CI: CI Benchmark for Code Maintainability

Updated 8 March 2026

SWE-CI is a benchmark that rigorously evaluates LLM-powered agents in dynamic continuous integration workflows by simulating multi-round software maintenance based on real open-source evolution.
It employs a dual-agent architecture where an Architect generates incremental requirements and a Programmer applies iterative code updates, emphasizing long-term maintainability.
Key metrics such as Normalized Change, EvoScore, and Zero-Regression Rate quantify performance improvements and highlight challenges in sustaining code quality over extended CI cycles.

SWE-CI is a benchmark designed to rigorously evaluate the capabilities of LLM-powered agents in maintaining software codebases within continuous integration (CI) workflows. Unlike prior benchmarks that assess functional correctness in static, one-shot repair scenarios, SWE-CI introduces a dynamic, multi-round evaluation based on realistic evolution histories from open-source repositories. This framework emphasizes long-term maintainability, requiring agents to resolve sequences of test failures, propose incremental requirements, and sustain code quality over dozens of simulated integration cycles. The benchmark delivers new insights into the technical limitations of current LLM agents, as well as metrics and protocols that more accurately reflect the demands of real-world software maintenance (Chen et al., 4 Mar 2026).

1. Limitations of One-Shot Benchmarks and the SWE-CI Paradigm

Traditional software engineering agent benchmarks such as HumanEval, SWE-bench, and Terminal-bench assess performance only in snapshot-style, static settings: an LLM receives a full specification or issue and is tasked to produce a repair or feature in a single step. This approach captures short-term functional correctness but fails to model long-term software evolution. Real codebases evolve over months or years, incorporating frequent interface modifications, complex feature additions, and repeated mitigations of regressions across many CI cycles. One-shot benchmarks do not account for the incremental design decisions and cumulative technical debt that accrue during extended maintenance.

SWE-CI shifts this paradigm by simulating the entire CI loop over authentic commit histories. Rather than providing a fixed problem statement, tasks require agents to iteratively diagnose failing tests, author requirement increments, and implement code changes across multiple rounds. Critically, earlier implementation choices by the agent influence constraints and possible solutions in subsequent rounds, anchoring evaluation in long-term maintainability rather than localized correctness.

2. Task Construction and Benchmark Dataset

SWE-CI is constructed from real-world, open-source Python repositories exhibiting substantial codebase evolution and active development. Task selection follows a multi-stage process:

Repository Selection: All public Python GitHub repositories with at least three years of activity, 500+ stars, a test-driven development suite, and a permissive license are considered. This yields 4,923 candidate repositories.
Commit Span Extraction: The main branch is linearized and maximal consecutive commit spans are identified where dependencies are stable. Candidate tasks must comprise at least 1,000 lines of code modified (tests excluded), producing 8,311 initial spans.
Environment Construction: For each span, Dockerized environments are auto-generated, dependencies are self-repaired, and test suites are executed to ensure reproducibility, filtering to 1,458 viable cases.
Case Filtering: Final benchmarks are filtered to ensure the base state passes all existing tests, and the oracle state adds at least five new positive tests. The top 100 tasks are selected by evolution time span and commit count.

Benchmark statistics:

100 tasks from 68 repositories
Average task: 233 days, 71 consecutive commits, ≥500 modified LOC per task (excluding tests)
Each task provides the full source tree, tests, and an environment snapshot

Each task constitutes a commit sequence $(c_0 \rightarrow c_1 \rightarrow \cdots \rightarrow c_N)$ , with $c_0$ as the base commit (start) and $c_N$ as the oracle (target). At each iteration $t$ , an Architect agent generates a requirement $\Delta R_t$ based on failures; a Programmer agent applies the requirement, updating the codebase.

3. Continuous Integration Loop and Agent Protocol

The evaluation protocol formalizes the agent-driven CI loop:

At round $t$ , the state is $c_t$ .
Oracle functions:
- $\Delta R_t = \mathrm{require}_{\mathcal{T}}(c_t, c_*)$ — generates the incremental requirement based on the delta to the oracle.
- $c_{t+1} = \mathrm{code}_{\mathcal{T}}(c_t, \Delta R_t)$ — applies the requirement to update the codebase.
Dual-agent architecture:
- The Architect summarizes test failures, identifies root causes, and outlines 1–5 high-level requirements (output as XML).
- The Programmer parses requirements, plans code edits under /app/code/, implements changes, and submits code.
- Test results are externally evaluated and fed back to the Architect, closing the loop.

The process iterates for up to 20 rounds or terminates if all tests passing in the oracle commit also pass in the current state.

4. Metrics for Long-Term Maintainability

SWE-CI provides multiple granular metrics for evolution-aware evaluation:

Normalized Change $a(c)$ : For $n(c)$ denoting tests passed by codebase $c$ , $n_0 = n(c_0)$ , $n_* = n(c_*)$ ,

$a(c) = \begin{cases} \frac{n(c)-n_0}{n_*-n_0} & \text{if } n(c) \ge n_0 \ \frac{n(c)-n_0}{n_0} & \text{if } n(c) < n_0 \end{cases}$

scaling improvements to $[0,1]$ , regressions to $[-1,0]$ .

EvoScore $e$ : A future-weighted mean of $a(c_t)$ across $T$ rounds:

$e = \frac{\sum_{t=1}^T \gamma^t a(c_t)}{\sum_{t=1}^T \gamma^t}, \quad \gamma \ge 1$

Higher $\gamma$ emphasizes later rounds, aligning the metric with long-term maintainability.

Zero-Regression Rate: Fraction of tasks where no previously passing test fails throughout all iterations.
Cumulative Success Rate by Round: At round $t$ , the fraction of tasks where $c_t$ passes 100% of the oracle tests.

These metrics capture not only endpoint correctness but also the stability and robustness of agents’ iterative design decisions.

5. Experimental Results and Model Comparison

The benchmark evaluates 18 LLM-powered agent variants from eight providers, orchestrated via the iFlow CLI (up to 20 rounds, with both Architect and Programmer roles sharing the same LLM). Test execution leverages pytest and pytest-json-report, with a 3,600-second timeout per run.

Key findings:

Newer models in each lineage consistently outperform predecessors; Claude Opus leads in both absolute and maintainability metrics.
Provider-wise, MiniMax, DeepSeek, and GPT families demonstrate better long-term maintainability when weighted with higher $\gamma$ ; GLM and Kimi peak with lower $\gamma$ , indicating strength in short-term tasks. Qwen, Doubao, and Claude variants show stable performance across $\gamma$ values.
Zero-regression remains challenging: most models have rates below 0.25; only two Claude Opus versions exceed 0.50.
Top models (Claude Opus v2, GLM-5) exhibit EvoScores of approximately 0.56 and 0.48 (with $\gamma=1$ ), zero-regression rates of approximately 0.52 and 0.24, and full success rates after 20 rounds of 38% and 22% respectively.

6. Failure Modes and Diagnostic Analyses

Several salient failure modes are documented:

Interface Drift: Agents may hard-wire function signatures prematurely, causing incompatibilities with future requirements or tests.
Ad hoc Patching and Overfitting: Early-round solutions may bypass underlying architectural problems, accumulating technical debt and leading to subsequent regressions.
Complex Feature Planning: Agents struggle with multi-step refactorings and architectural changes required for deep design updates (e.g., evolving class hierarchies).

The diagnostic analysis indicates that while LLMs are strong in one-shot bug-fix scenarios, they lack robust architectural modeling, resulting in “myopic” updates. Agents that assign greater weight to long-term rounds (higher $\gamma$ ) tend to produce cleaner abstractions, but cross-module invariants remain difficult to maintain. The Architect–Programmer split improves modular decision-making, albeit with ongoing sensitivities to prompt engineering and context limitations. This suggests further advances will require architectural reasoning and extended context handling.

7. Conclusions and Future Directions

SWE-CI demonstrates a persistent gap between pass/fail snapshot benchmarks and the nuanced demands of sustained code maintainability. Even the most capable evaluated agents achieve only around 40% full success after 20 rounds and are prone to frequent regressions.

Fine-grained evolution metrics such as EvoScore and zero-regression rate enable diagnostic clarity that static benchmarks lack. Key future directions include leveraging commit history retrieval, integrating project-wide context for better planning, adding explicit architectural reasoning modules (e.g., abstract syntax tree-based planners), broadening the CI-loop to new agent types (e.g., automated code review, static analysis), and expanding to multi-language, polyglot codebases.

SWE-CI thus establishes the first rigorous, repository-level evaluation standard for long-term software agent maintainability, providing a foundation for the development and assessment of future code intelligence systems (Chen et al., 4 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-CI Benchmark.