- The paper introduces SWE-CHAIN, a benchmark evaluating coding agents with chained, release-level upgrades using real release notes and code diffs.
- The methodology employs the DecompSynth pipeline to generate structured upgrade requirements and assess performance via resolving rate, precision, and F1 metrics.
- Results reveal that top agents still struggle with cumulative integration, emphasizing the need for improved state tracking and error recovery in long-horizon maintenance.
SWE-CHAIN: Systematic Benchmarking of Coding Agents in Chained Release-Level Package Maintenance
Motivation and Benchmark Design
SWE-CHAIN addresses a critical gap in the evaluation of autonomous coding agents operating under realistic, long-horizon software engineering workflows. Whereas previous benchmarks focus primarily on isolated issues or synthetic maintenance checkpoints, SWE-CHAIN centers on chained release-level package upgrades, precisely reflecting real-world maintenance bounded by official shipping versions. Each benchmark instance represents a sequence of version-to-version upgrades in a major Python package, with each transition defined by actual maintainer-authored release notes, concrete code diffs, and chained progression dependent on the agent's own previous outputs.
To mitigate the challenge of noisy and underspecified requirements prevalent in prior datasets (e.g., raw concatenation of issues, PRs, or terminal outputs as in SWE-EVO), SWE-CHAIN introduces the DecompSynth pipeline. This configurable, divide-and-conquer agentic synthesis system aligns semantically-structured specifications directly with concrete code and test diffs, producing upgrade requirements that are maximally informative, tightly grounded in real changes, and designed for robust agent consumption. The pipeline encompasses structured task extraction, multi-label hunk-to-requirement matching, rigorous filtering, and multi-level granularity control. Manual quality review is applied to normalize outputs and ensure exclusion of irrelevant transitions such as yanked versions or changelog-only releases.
Benchmark Scale, Specification, and Protocol
SWE-CHAIN covers 12 upgrade chains across 9 mature Python packages, spanning 155 consecutive version transitions and 1,660 synthesized, code-change-grounded upgrade requirements. Chains capture substantial diversity in codebase size, release granularity, and task density—ranging from compact security libraries (PyJWT) to large-scale scientific computation frameworks (xarray). Each chain is fully containerized, with infrastructure to pin interpreters and dependencies across all intermediate versions, guaranteeing reproducibility, isolation, and consistent error handling.
Upgrade chain progression is agent-centric: the agent begins from the base codebase and must carry its own modifications forward at each release, with opportunity for a single patch attempt in response to execution errors (the Build+Fix regime), reflecting real automated development constraints. Extensive controls prevent information leakage and tool misuse—including sandboxing, enforced tool blacklists, and explicit prompt-level disallowal of external code or package fetching.
SWE-CHAIN evaluation employs three main metrics at each upgrade step, micro-averaged across chains:
- Resolving Rate: Fraction of upgrade-related test behaviors implemented correctly (analogous to recall in test-level classification).
- Precision: Preservation of correctness; penalizes regressions on previously passing behaviors.
- F1-score: Harmonic mean reflecting balanced performance.
Empirical results demonstrate nontrivial difficulty: across nine state-of-the-art coding agent configurations (including OpenAI Codex, Anthropic Claude Code, MiniMax, GLM), the average Build+Fix resolving rate is 44.8%, precision is 65.4%, and F1 is 50.2%. The top configuration, Claude-Opus-4.7 (Claude Code CLI), achieves 60.8% resolving, 80.6% precision, and 68.5% F1—substantially higher but well below perfect. These outcomes underscore that current frontier LLM-based agents remain brittle in chained, cumulative package upgrade scenarios. Notably, Build+Fix regularization primarily improves precision, indicating a tendency for agents to correct brittle regressions on the second pass rather than significantly increasing overall feature delivery.
Performance varies sharply by chain; for example, easier targets like PyJWT and Jinja2 exhibit resolving rates near 70% for leading agents, while complex or large-scale upgrades (e.g., conan, xarray) remain below 30%. This difficulty gradient is attributable both to absolute codebase scale and to per-release code-diff density.
Specification Granularity, Resource Usage, and Efficiency
A controlled study evaluates the effect of specification granularity. Five specification variants are instantiated, ranging from noisy raw release notes and PRs (L1, reminiscent of SWE-EVO) to detailed, oracle-style requirements with fully grounded behaviors and acceptance criteria (L5). The critical finding is that conceptual expectations and constraints (i.e., clear, user-facing requirements with behavioral boundaries) dramatically enhance agent safety and reliability compared to raw artifacts. Higher granularity (L4/L5) further closes the gap for top agents but may exceed the realism of practical, maintainer-provided specifications. The default L3 variant in SWE-CHAIN is positioned as a strong trade-off that prevents overfitting while remaining actionable.
Efficiency measurements illustrate that top-performing agents incur substantial computational cost (e.g., $150.39/chain for Claude-Opus-4.7) and high token/tool-call utilization, but this does not universally correlate with downstream performance.
Implications, Limitations, and Future Research Outlook
SWE-CHAIN reveals several core limitations of current LLM-based coding agents for release-level package maintenance:
- Cumulative Integration Failure: Agents often produce upgrades that break existing functionality or fail to implement all behavioral requirements when tasked to carry forward their own work over long chains.
- Lack of Dominance Across Chains: Model performance differentials are package- and transition-specific, showing the absence of a universally dominant agent and indicating entanglement with codebase structure and specification style.
- Challenges With Granular Guidance: Raw changelogs and unstructured PRs are insufficient for reliable maintenance; explicit behavioral expectations and conceptual boundaries are essential.
SWE-CHAIN’s discriminative difficulty and chain-specificity provide a high-precision signal for differential agent evaluation and improvement. Immediate implications include the need for:
- Next-generation agent architectures with stronger cross-version state tracking, error recovery, and context management.
- More effective and data-efficient usage of release metadata for behavioral requirement induction.
- Scaling beyond Python and inclusion of ecosystem-diverse repositories and build systems.
Theoretically, SWE-CHAIN pushes towards establishing benchmarks where sequential agent outputs are subject to compounding error and functionality drift, shifting the evaluation paradigm towards true software maintenance, rather than isolated one-shot patching.
Conclusion
SWE-CHAIN constitutes a robust, high-fidelity benchmark for evaluating LLM-powered coding agents in the context of chained, release-bounded, real-world software maintenance. The benchmark directly targets the longitudinal reliability and evolutionary maintainability of agent-produced code, exposing substantial limitations in current-generation models. SWE-CHAIN provides not only a framework for agent comparison but also rich experimental probes for research in long-horizon autonomy, requirement induction, and integrated agentic workflows. This foundation is likely to structure future advances in automated software engineering, both for benchmarking next-generation agent systems and for informing the design of practical AI-powered developer tools.