SWE-CI: CI Benchmark for Code Evolution
- SWE-CI is a benchmark that evaluates LLM-powered agents through multi-turn CI loops, emphasizing long-term maintainability over snapshot correctness.
- It operationalizes real-world repository evolution with tasks spanning an average of 233 days, hundreds of commits, and substantial code changes.
- Performance is measured via EvoScore, normalized change, and zero-regression rate, offering actionable insights into agents’ ability to manage technical debt.
SWE-CI is a repository-level benchmark designed to evaluate the capability of LLM–powered agents to maintain and evolve codebases via a continuous integration (CI) loop, with a focus on long-term maintainability rather than snapshot functional correctness. Launched to address fundamental limitations in traditional, one-shot static code benchmarks, SWE-CI operationalizes a realistic multi-turn repair, refactoring, and feature iteration workflow over real-world repositories. Each task requires dozens of rounds of agentic analysis and code modification, tracing complex software evolution spanning months and hundreds of commits—the first such benchmark to systematically quantify sustained code quality and technical debt management by autonomous agents (Chen et al., 4 Mar 2026).
1. Formalization of the SWE-CI Evaluation Loop
SWE-CI defines an agentic evolution protocol over codebases , requirements , and a unit test suite . Two oracles underpin the loop:
- : Given the current codebase and a golden/reference codebase , outputs a requirement describing the test-based delta.
- : Applies requirement to , yielding new candidate 0.
Unlike static snapshot benchmarks, which process 1 in a single step, SWE-CI instantiates a CI loop of length 2:
3
for 4, halting when 5 (modulo tests) or upon exhausting the iteration budget. Each benchmark task 6 corresponds to a real repository 7 commit span, with the agent incrementally "closing the gap" against moving targets as defined by the oracle functions (Chen et al., 4 Mar 2026).
2. Benchmark Construction and Dataset Characteristics
SWE-CI comprises 100 tasks drawn from 68 distinct Python repositories. Selection criteria ensure:
- Mean evolution span: 233 days per task
- Mean number of intermediate commits: 71
- Minimum 1000 lines of code (LOC) changed per task (excluding test files)
- Each repository: active maintenance 8 3 years, 9 500 GitHub stars, explicit lockfiles, nonrestrictive license, and comprehensive unit tests
The pipeline is as follows:
- Repository Collection: 0150,000 public projects filtered to 4,923 by activity, stars, license, and tests.
- Commit Span Extraction: Linearize histories, segment spans without dependency changes, and select those with 11,000 LOC diff (28,311 endpoint pairs).
- Environment Construction: Automated Dockerization and test re-execution; missing dependencies are injected to recover environment reproducibility (31,458 viable pairs).
- Filtering & Selection: Pairs are filtered for launchable tests and minimum test delta; tasks are then ranked by time span and commit count, with the top 100 chosen for the release set (Chen et al., 4 Mar 2026).
3. Evaluation Metrics for Maintainability and Evolution
SWE-CI measures an agent’s long-term software engineering competence via three quantitative metrics:
- Normalized Change 4: Measures test progress or regression per iteration. For 5(test 6 passes on 7):
8
9, yielding 0 for progress, 1 for regression.
- EvoScore 2: A future-weighted mean of 3 across CI iterations; for discount parameter 4:
5
6 biases toward later iterations, rewarding sustainable fixes.
- Zero-Regression Rate: The fraction of tasks for which 7 at every iteration—quantifies an agent's ability to avoid regressions over long codebase evolution (Chen et al., 4 Mar 2026).
4. Comparison with Prior Static Code Benchmarks
SWE-CI introduces dimensions absent in prior static benchmarks such as HumanEval, MBPP, or SWE-bench:
| Paradigm | Data Granularity | Capability Measured |
|---|---|---|
| Static (e.g. SWE-bench) | One-shot, single requirement | Functional correctness at snapshot |
| SWE-CI | Multi-turn, full repo, multi-commit | Long-term maintainability, technical debt management |
SWE-CI tasks require sustained reasoning and long-horizon planning, with agents challenged to maintain or improve code quality iteratively. Unlike snapshot pass@1 metrics, SWE-CI explicitly evaluates if models "stay ahead of tests" over realistic software lifecycles (Chen et al., 4 Mar 2026).
5. Experimental Results and Observations
Eighteen LLMs from eight providers were evaluated on SWE-CI using a dual-agent protocol (Architect + Programmer), with each task capped at 20 CI iterations. Key findings include:
- EvoScore Trends: Models released post-2026, especially the Claude Opus series and GLM-5, consistently outperform earlier architectures.
- 8-Sensitivity: Provider-level differences in weighting short- vs. long-term gains—GPT and MiniMax favor 9 (late iterations), while GLM/Kimi favor 0 (early iterations).
- Zero-Regression Rates: Most agents achieve zero regressions in fewer than 25% of tasks; only select Claude-Opus variants exceed 50%.
Despite overall progress, state-of-the-art LLMs still exhibit significant difficulties with regression avoidance and long-horizon maintainability planning (Chen et al., 4 Mar 2026).
6. Representative SWE-CI Workflow Example
A typical SWE-CI scenario involves:
- Base commit (1): 12 new failing tests, three regressions
- Iteration 1: Architect proposes implementing a missing utility function and patching error handling; Programmer integrates code; eight new tests pass
- Subsequent iterations: Architect identifies needed refactorings or test-driven repairs; Programmer implements; intermediate 2 approaches 1 as agent closes the test gap
- Final iteration: All oracle tests pass, 3, and an EvoScore with 4 reflects the contribution of later, stabilization-focused changes (Chen et al., 4 Mar 2026).
7. Limitations, Challenges, and Future Directions
Identified challenges include:
- Reproducibility: Docker environments for legacy repositories may break due to native dependencies or OS drift.
- Language Scope: The current release is Python-only; extension to other ecosystems remains open.
- Test Coverage Variance: Varying test suite quality in source repositories introduces bias, risking underestimation of regression rates.
- Agent Protocol Complexity: The dual-agent (Architect/Programmer) setup captures realistic workflows but increases system complexity.
Future extensions are anticipated to address:
- Multi-language CI benchmarking (Java, JS, C++)
- Incorporation of richer CI events (linting, type-checking, security analysis)
- Human-in-the-loop evaluation (code reviews)
- Agents trained via reinforcement learning directly optimized for EvoScore (Chen et al., 4 Mar 2026).
8. SWE-CI and the Emergence of Repository-Centric Evaluation
The inadequacy of Task-Centric Learning (TCL) for small models has led to the development of Repository-Centric Learning (RCL) paradigms, which prioritize deep vertical mastery of a single codebase over broad task distribution. Work with models such as SWE-Spot-4B demonstrates that RCL fine-tuning results in agents with superior inference efficiency and sample efficiency within evolving codebases, making them especially apt for resource-constrained and privacy-sensitive settings typical of real-world CI/CD pipelines. For SWE-CI-style tasks, RCL enables small agents to maintain performance without the extensive online search that large frontier models require, and incremental RCL offers a tractable path for adaptive repo-expert deployment as codebases change (Peng et al., 29 Jan 2026).
References:
(Chen et al., 4 Mar 2026) "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration" (Peng et al., 29 Jan 2026) "SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning"