Papers
Topics
Authors
Recent
Search
2000 character limit reached

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Published 25 Mar 2026 in cs.SE, cs.AI, and cs.CL | (2603.24755v1)

Abstract: Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

Summary

  • The paper introduces SlopCodeBench to evaluate iterative coding agents and reveals that no model achieves a complete end-to-end solution.
  • It employs novel metrics like verbosity and structural erosion to quantify the accumulation of technical debt during multi-turn code extensions.
  • Experimental results show that while prompt strategies lower initial slop, they fail to prevent long-horizon architectural degradation.

SlopCodeBench: Assessing Long-Horizon Degradation in Iterative Coding Agents

Motivation and Benchmark Design

The paper introduces "SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks" (2603.24755), a benchmark specifically devised to fill a pronounced methodological gap in the evaluation of code generation agents. Current agentic coding benchmarks focus predominantly on single-shot solutions, evaluating correctness against a complete specification but failing to capture the structural and design degradation that manifests when code is iteratively extended. This oversight ignores a central property of authentic software development: the accumulation of technical debt and architectural slop through incremental changes as requirements evolve—a dynamic rarely measurable via pass-rate-centric approaches.

SlopCodeBench (SCBench) targets this critical failure mode by tasking agents with repeatedly extending their own prior code across up to eight specification checkpoints per problem instance. Each problem is defined by observable behavioral contracts (CLI or API boundaries) and intentionally withholds the internal interface structure and explicit test suites, forcing the agent to commit to architectural choices whose implications are revealed only in subsequent iterations. The multi-turn iterative paradigm surfaces how early optimality can result in downstream complexity, redundancy, and code "slop."

Trajectory-Level Quality Metrics

To detect forms of quality decay not obvious in test-passing rates, the benchmark operationalizes two codebase-level structural metrics:

  • Verbosity: Quantifies redundant and duplicated code through a composite score combining AST-Grep rule violations and clone detection, normalized by logical lines of code.
  • Structural Erosion: Measures the fraction of complexity mass (a product of cyclomatic complexity and SLOC\sqrt{\text{SLOC}} per function) concentrated in high-CC functions (CC >> 10), highlighting how new logic tends to accrete into large, unmanageable callables during iterative agent extension.

These orthogonal axes are crucial because codebases can remain functionally correct while becoming progressively less maintainable and less extensible, a reality that defies single-turn evaluation.

Experimental Results

The evaluation covers 20 language-agnostic programming problems, each featuring 3–8 checkpoints (93 checkpoints total), and includes 11 leading code-generation models such as Opus 4.6, several variants of GPT-5.x Codex, and Sonnet. The principal findings are as follows:

  • Zero Complete Solution Trajectories: No model solves any problem end-to-end. The best strict solve rate per checkpoint remains at 17.2% for Opus 4.6.
  • Monotonic Quality Decay: Verbosity increases in 89.8% and erosion in 80% of agent solution trajectories as checkpoints progress, independent of underlying correctness. Figure 1

    Figure 2: Solve rates and cost growth over checkpoint progress; strict solve rates drop sharply by the final checkpoint, while mean cost per checkpoint triples.

  • Disjunction from Pass-Rate Metrics: Pass rates for core and functionality tests are relatively preserved, but error-handling and regression pass rates collapse as tasks progress (core tests are up to 13x easier to pass than the complete suite by the final checkpoint).
  • Exponential Cost Increase: Despite a 2.9x increase in mean cost per checkpoint from the first to the last progress bin, no corresponding gain in correctness is observed. Figure 3

    Figure 4: Erosion and verbosity increase monotonically for all evaluated models, indicating consistent accumulation of technical debt across iterative progress.

  • Human Repository Comparison: When contrasted with 48 mature open-source Python repositories, agent-generated code is on average 2.2x more verbose and exhibits substantially higher structural erosion. Critically, agent metrics on both axes grow consistently with each checkpoint, whereas human repositories plateau. Figure 5

    Figure 6: Mean verbosity and structural erosion plotted for agent runs (rising) and human repository histories (flat), demonstrating divergence over development time.

Effects of Prompt Strategies

A prompt-side intervention study evaluated two quality-focused prompt strategies: "anti-slop" (explicitly forbidding verbose and overengineered patterns) and "plan-first" (requiring explicit architectural planning prior to code emission), in addition to a minimalist baseline.

  • Lowered Initial Verbosity/Erosion but No Slope Change: Both interventions reduce initial codebase verbosity (by 33–35%) and erosion (by up to 50%), but the rate of quality degradation per checkpoint is statistically indistinguishable from baseline. Figure 7

    Figure 1: Prompt strategies shift the starting point of verbosity and erosion (lower intercepts) but do not alter the trend; trajectories remain parallel across all prompt types.

  • No Consistent Correctness Improvement: Substantial reductions in structural slop do not produce statistically significant improvements in any pass-rate subtype. In some cases, total monetary cost increases by up to 48% for only marginal code quality benefits.

Discussion and Implications

These results contradict a foundational assumption implicit in existing pass-rate benchmarks: that syntactic and architectural quality are meaningfully tethered to correctness. In SCBench, an agent can consistently pass core tests while its code becomes dramatically less maintainable and harder to extend—a failure mode hidden from traditional single-shot or test-passing evaluation. The outcome is robust to prompt-side instruction following: neither prescriptive code style nor enforced planning can halt the acceleration of architectural slop once the agent is required to chain its own outputs.

Practically, this signals that the current agent architectures—designed primarily for prompt completion rather than long-horizon planning—are grossly inadequate for iterative software engineering tasks. Techniques such as code health–aware action planning, explicit intermediate refactorization, or trajectory-level regularization (potentially at finetuning time) are necessary research directions. Furthermore, the fact that error-handling and regression test performance deteriorate most sharply exposes risk areas for safety-critical and evolving codebases under agentic maintenance.

From a theoretical perspective, the work formalizes a metric-based framework to quantify and track codebase decay dynamically, disentangled from correctness. This provides an empirical substrate for analyzing technical debt and architectural rot in agent-generated software, which will be vital for the development of long-term automated engineering agents.

Directions for Future Research

Key follow-up avenues emanating from this work include:

  • Architectural Memory and Planning: Integrating explicit design reasoning or memory modules that allow agents to revisit and refactor architectural choices, thereby reducing the rate at which technical debt is acquired.
  • Training Objective Augmentation: Incorporation of regularization terms tied directly to codebase verbosity and structural erosion during pretraining or finetuning.
  • Continuous Evaluation Artifacts: Broader adoption of SlopCodeBench-like benchmarks, extended to other languages and real-world software lifecycles, to measure how interventions or new paradigms affect long-horizon codebase health.
  • Automated Repair and Refactoring Loops: Embedding periodic health checks and automated repair interventions into agent action loops to proactively address metric drift.

Conclusion

SlopCodeBench exposes a significant limitation of current code-generation agents: their inability to sustain architectural discipline across long-horizon, iterative software development. Structural decay and verbosity accumulate monotonically in agent-generated code, independently of superficial test-passing correctness, and prompt-based mitigations are insufficient. The findings underscore the necessity of fundamental advances in agent architecture, objective design, and benchmarking if coding agents are to be deployed safely and effectively in real-world, evolving codebases.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper introduces SlopCodeBench, a new way to test AI coding tools over time, not just once. The idea is simple: in real life, software changes again and again. When AI agents write code, it might work today but become a messy, hard-to-extend “slop” tomorrow. SlopCodeBench checks whether AI-written code stays clean and easy to build on as requirements keep changing.

What questions were the researchers asking?

  • Can today’s AI coding agents keep their code clean and easy to update when they must extend their own earlier work?
  • How does code quality change over many rounds of edits?
  • Do simple “pass the tests” scores miss important problems, like growing messiness?
  • How do AI agents compare to human programmers on code quality over time?
  • Can better prompts (like “plan first” or “avoid messy code”) prevent code from getting worse?

How did they study it?

Think of building a Lego project where the instructions keep changing. You start with a car, then you’re told to turn it into a truck, then add a crane, then make it fly. If your first design was rigid, every new feature becomes a struggle.

That’s what SlopCodeBench does for code:

  • It includes 20 problems split into 93 “checkpoints.” At each checkpoint, the AI gets a new requirement and must extend its own old code—not start fresh.
  • The tests are “black-box.” The AI sees the instructions and examples, but not the hidden tests. This prevents the AI from gaming the test suite and forces real design choices.
  • The tasks are language-agnostic (you could write the solution in different programming languages), though the paper focuses on Python for the experiments.
  • They evaluated 11 different AI coding agents using their official tools.

To measure “slop,” they tracked two simple ideas:

  • Verbosity: extra or repeated code that doesn’t add new value. Think of writing the same sentence three different ways—it’s longer, not better.
  • Structural erosion: cramming more and more decision-making into a few giant functions. Imagine one huge “God function” that tries to do everything—this makes future changes painful.

They also compared agent code to 48 real, maintained human Python projects and watched how those human projects changed over time. Finally, they tried prompt tricks like “plan first” or “avoid bloat” to see if that stops the slide into slop.

A quick example

One task starts as a simple search tool, then later asks for:

  • Multiple programming languages,
  • Smarter pattern matching using syntax trees,
  • Auto-fixes, and
  • Even more languages.

If the AI hardcodes early choices to handle just one language, it has to hack and patch later. If it designs a clean, flexible structure early on, adding features is smoother. SlopCodeBench rewards the latter—but the agents have to figure that out themselves.

What did they find?

  • No AI solved a full problem from start to finish across all checkpoints.
  • The best “strict” success rate at a checkpoint was only 17.2% (strict means passing all tests, including ones from earlier checkpoints).
  • Code quality steadily got worse as agents iterated:
    • Structural erosion went up in 80% of runs.
    • Verbosity went up in 89.8% of runs.
  • Compared to human projects, agent code was much messier:
    • Agent code was about 2.2× more verbose on average.
    • Agent code had far more “big, complicated” functions than human code.
  • Over time, human projects tended to stay steady, but agent code consistently got worse with each iteration.
  • Prompts that said “avoid slop” or “plan first” helped the very first version look cleaner, but they did not stop the steady decline later on.
  • These “quality-aware” prompts often cost more money and time—and still didn’t improve test pass rates.

Why is this important?

Most coding benchmarks today focus on whether the code passes tests right now. But in real software, you live with your code. You need to be able to add features, fix bugs, and grow the system. This paper shows that:

  • Test pass rates alone can hide serious long-term problems. Code can pass tests and still be turning into a tangled mess.
  • Current AI coding agents don’t reliably make future-friendly designs on their own. They tend to pile on patches, repeat code, and bloat a few giant functions as tasks evolve.
  • We need better ways to train and guide AI agents—tools, feedback, or training methods that teach them to keep code clean across many changes, not just solve the current step.

In short

SlopCodeBench is a new “long-term” test that shows how AI-written code gets messy when requirements change over time. It finds that today’s agents often create code that becomes harder to extend, even if it passes tests. Human projects stay more stable. Simple prompt fixes help at the start but don’t stop the decline. The takeaway: to build reliable AI coding assistants, we must measure and improve how well they handle long, evolving projects—not just single tasks.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable future research.

  • External validity across languages is untested: although SlopCodeBench is designed to be language-agnostic, all experiments target only the Python track; it remains unknown whether the degradation dynamics and metrics (verbosity, erosion) hold for Java, C/C++, Rust, Go, etc., and whether tooling/metrics remain comparable across languages.
  • Benchmark scale and task diversity may be insufficient: 20 problems and 93 checkpoints might not cover the breadth of real-world architectural pressures (e.g., large multi-module repos, services, GUIs, build systems, package management), limiting generalizability.
  • Realism of “no tests visible” constraint is unvalidated: hiding test suites and prohibiting fail-to-pass feedback may not reflect common developer workflows where tests, CI, and linters are present; it is unknown how results change when agents can see tests or write/maintain their own tests.
  • Memoryless iteration between checkpoints may mask potential improvements: the benchmark resets agent session state and prior conversation context each turn; whether memoryful agents (with persistent plans, summaries, or design docs) slow degradation is not assessed.
  • Harness confounds are not disentangled from model ability: evaluating via heterogeneous, provider-specific CLI harnesses (with different versions and behaviors) risks conflating harness policy with model capability; a controlled ablation using a standardized loop is needed.
  • Sensitivity to harness versions is only lightly probed: the main results pick one version per model; a systematic sensitivity study across harness versions, settings, and toolchains is missing.
  • Validity of structural erosion as a maintainability proxy is unverified: the chosen mass formula (CC × sqrt(SLOC)) and CC>10 cutoff are heuristic; there is no user study or empirical validation linking this metric to developer effort, defect rates, or extension time.
  • Cross-language comparability of complexity metrics is unclear: cyclomatic complexity thresholds and mass aggregation may not transfer directly across languages and idioms; per-language calibration and thresholds are not evaluated.
  • Verbosity metric relies on 137 hand-authored AST-Grep rules without uncertainty estimates: coverage, false positives/negatives, and portability of these rules to other languages and coding styles are not measured; rule-set ablations and cross-language rule design remain open.
  • Clone detection details and robustness are under-specified: the clone detection tool, thresholds, and configuration are not fully detailed; sensitivity to clone type (Type-1/2/3/4), normalization choices, and repository size is unknown.
  • Other key quality dimensions are not measured: coupling, cohesion, modularity, dependency management, dead code, documentation quality, type hints, test coverage/quality, and API stability are omitted; the relationship between these dimensions and extension robustness is open.
  • Runtime and resource efficiency are untested: performance, memory use, latency, and scalability across checkpoints are not assessed; whether verbosity/erosion correlate with runtime regressions is unknown.
  • Human calibration is unmatched and potentially confounded: comparing agent outputs to unrelated open-source repos (stratified by stars) does not control for domain, size, maturity, or process; a matched human baseline solving the same SCBench tasks is needed.
  • Contamination risk is not formally audited: although specifications are synthetic, there is no formal contamination check against model pretraining data or harness exemplars; the impact of any overlap is unknown.
  • Error-handling is identified as a major failure mode but not targeted: error tests drive pass-rate declines, yet the paper does not evaluate targeted interventions (e.g., exception policies, fuzzing, negative-test planning) to improve error robustness over iterations.
  • Prompt-only interventions are limited; structural/tooling levers remain untested: the paper shows anti-slop and plan-first prompts reduce initial slop but not slope; effects of continuous code-quality feedback (e.g., gating on CC, clone ratio, linting/formatting, automated refactoring) are unstudied.
  • Multi-agent and human-in-the-loop strategies are not evaluated: planner–implementer–reviewer roles, code review cycles, or periodic human refactoring might arrest degradation; their effectiveness on slope reduction is unknown.
  • Refactoring allowances and rewrite strategies are not manipulated: the benchmark encourages incremental extension of prior code; studies that explicitly permit/require periodic refactors or modular rewrites to test whether slope decreases are missing.
  • Task-level analysis of degradation triggers is shallow: the paper cites a few exemplars (e.g., code, circuit_eval, dag_execution) but does not systematically identify task characteristics (e.g., cross-language expansion, complex dispatch, stateful APIs) that predict erosion/verbosity growth.
  • Patch dynamics are not quantified: the size, locality (append vs refactor), and distribution of changes across functions/modules over time are not analyzed; understanding when/where duplication and CC spikes are introduced is an open area.
  • Relationship between current quality metrics and future extensibility is uncalibrated: while verbosity/erosion increase, their predictive power for future checkpoint pass rates, time-to-implement, or defect introduction is not modeled.
  • Cost–quality tradeoffs are underexplored: analysis beyond average dollars/minutes per checkpoint (e.g., token budgets, “thinking” vs “acting” time, diminishing returns curves) is missing; it’s unclear which spending patterns, if any, slow degradation.
  • Effect of development tooling is unknown: formatting (e.g., black), linting (flake8/ruff), static typing (mypy/pyright), and test scaffolding may alter verbosity/erosion; these were not systematically varied.
  • Impact of conversation history and architectural artifacts is untested: providing agents with persistent designs, ADRs, diagrams, or code maps across checkpoints might change trajectories; this remains unexplored.
  • Security and defect trends are not measured: growth of vulnerabilities, unsafe patterns, or bug-introducing changes across iterations (and their correlation with verbosity/erosion) is unaddressed.
  • Long-horizon limits are modest: problems span 3–8 checkpoints; whether degradation accelerates, plateaus, or can be arrested beyond 10–20 turns remains unknown.
  • Benchmark reproducibility details need expansion: randomness control, OS/toolchain variability, and determinism of metrics/tests across machines are not fully detailed; reproducible runs across environments should be validated.
  • Generalizability to proprietary vs open models is uncertain: only 11 frontier configurations via Claude Code/Codex are tested; open-source models, smaller models, and fine-tuned agents are not assessed under identical conditions.
  • Alignment between correctness and quality is weak but unexplained: large reductions in verbosity/erosion did not improve pass rates; mechanisms (e.g., missing features, under-engineering side-effects) need analysis to design interventions that boost both.
  • Extensions to more realistic ecosystems are pending: tasks with external services, databases, networks, build/packaging, and multi-language repos are needed to test architectural discipline under richer constraints.
  • Metric release and cross-language tooling readiness need confirmation: for non-Python tracks, availability of parsers, CC analyzers, clone detectors, and AST-Grep equivalents (or alternatives) is unresolved.

Practical Applications

Immediate Applications

The following applications can be deployed now using SlopCodeBench’s artifacts (benchmark, metrics, harness setup) and the paper’s findings on degradation, verbosity, and structural erosion.

  • Industry — Procurement “bake-off” for coding agents
    • Sector(s): Software/DevTools, Enterprise IT
    • What to do: Evaluate candidate AI coding tools using SlopCodeBench before purchase. Compare strict/isolated/core pass rates, cost per checkpoint, and trajectory slopes for erosion and verbosity.
    • Tools/products/workflows: Internal evaluation harness; dashboard for per-model solve-rate vs. degradation; acceptance criteria (e.g., maximum erosion/verbosity slope).
    • Assumptions/dependencies: Access to vendor harnesses; compute/time budget to run the benchmark; internal thresholds calibrated to team norms; acknowledgment that results are Python-track unless extended.
  • Industry — CI/CD “slop gates” to block maintainability regressions
    • Sector(s): Software/DevTools, FinTech, HealthTech (regulated environments)
    • What to do: Add automated checks for erosion and verbosity on each PR (especially for AI-generated changes). Fail builds if metrics exceed a budget or slope threshold.
    • Tools/products/workflows: CI plugin (“SlopGuard”) integrating cyclomatic complexity (e.g., Radon-like), clone detection, and AST-Grep rules; PR annotations with diffs of quality metrics; “slop budgets” per service or repo.
    • Assumptions/dependencies: Reliable parsing for your language(s); baseline established per codebase to avoid false positives; developer training to interpret metrics; risk of metric gaming mitigated by periodic audits.
  • Industry — Code review assistants that flag and fix “slop”
    • Sector(s): Software/DevTools
    • What to do: Deploy a bot that comments on high-mass functions and verbosity patterns (copy-paste blocks, unnecessary branches), proposing targeted refactors.
    • Tools/products/workflows: AST-Grep rule pack derived from the paper’s 137 rules; “mass concentration” detector; quick-fix snippets; reviewers’ checklist for erosion hotspots.
    • Assumptions/dependencies: Review culture open to automated feedback; static analysis tuned to the project; refactor suggestions guarded by tests.
  • Industry — Prompt strategy and workflow templates
    • Sector(s): Software/DevTools, Consulting
    • What to do: Use “anti-slop” and “plan-first” prompts to improve initial code quality when engaging agentic tools; pair with CI slop gates to sustain quality.
    • Tools/products/workflows: Prompt libraries; internal playbooks combining plan-first → implement → refactor passes; cost tracking per checkpoint to catch diminishing returns.
    • Assumptions/dependencies: The paper shows prompts lower intercepts but not degradation slopes; teams need guardrails (quality gates, refactor steps) beyond prompting.
  • Industry — Cost/performance observability for agent workflows
    • Sector(s): Platform Engineering, DevOps, FinOps
    • What to do: Track cost per checkpoint, pass-rate subtype (core/functionality/error/regression), and quality metrics over time to identify “spend-more-for-less” phases.
    • Tools/products/workflows: Cost dashboards; alerts when cost rises without correctness gains; escalation to human-led refactoring.
    • Assumptions/dependencies: Access to usage/cost telemetry; tagging agent-sourced changes; consistent workload definitions.
  • Academia — Research baselines and courseware for iterative software engineering
    • Sector(s): Academia, Education
    • What to do: Use SlopCodeBench to study long-horizon code aging, ablation of architectural choices, and interventions (e.g., modular scaffolds, planning tools). Turn checkpoints into semester projects testing maintainability.
    • Tools/products/workflows: Reproducible Docker setup; assignments that grade both correctness and erosion/verbosity trajectories; human vs. agent comparative labs.
    • Assumptions/dependencies: Compute budget for runs; instructor guidance on interpreting metrics; ethical use policies for AI tools.
  • Open-Source stewardship — PR triage for bot/agent contributions
    • Sector(s): Open-Source, Community Maintainers
    • What to do: Auto-label PRs with erosion/verbosity deltas; request modularization when single functions absorb most complexity mass; reject PRs that fail regression quality gates.
    • Tools/products/workflows: GitHub Actions app; contributor docs explaining quality expectations; “slop debt” badges per directory.
    • Assumptions/dependencies: Community buy-in; balanced thresholds that don’t block legitimate complexity; allowance for domain-specific exceptions.
  • Model/agent selection inside enterprises
    • Sector(s): Enterprise IT, MLOps
    • What to do: Gate internal rollouts on long-horizon metrics (e.g., no adoption unless erosion slope < X and regression pass rate > Y across problem phases).
    • Tools/products/workflows: Internal “agent bake-off” pipeline; periodic re-evaluation after model updates.
    • Assumptions/dependencies: Stable benchmark snapshots to avoid training contamination; periodic recalibration as models change.
  • Safety/compliance guidelines for AI-assisted coding
    • Sector(s): Governance, Risk, Compliance (GRC)
    • What to do: Update engineering policies to require extension-robustness checks (hidden-test philosophy, regression tests preserved) and quality budgets for AI-generated code.
    • Tools/products/workflows: Policy addenda referencing trajectory-level metrics; audit trail capturing quality deltas; approval workflows for exceptions.
    • Assumptions/dependencies: Mapping metrics to compliance frameworks (e.g., ISO/IEC 25010 maintainability), and to industry-specific standards.
  • Individual developers and small teams — “Slop scan” before merging
    • Sector(s): Daily practice, Indie devs, Startups
    • What to do: Run a lightweight CLI to report top erosion hotspots and duplicate blocks; refactor high-mass functions before shipping.
    • Tools/products/workflows: Simple local tool bundling CC, clone detection, and AST-Grep; pre-commit hook; short refactor checklists.
    • Assumptions/dependencies: Willingness to budget a small amount of time for refactoring; tests to maintain behavior during cleanup.

Long-Term Applications

These applications need further research, scaling, standards, or productization to be broadly feasible.

  • Industry/Research — Agent architectures with enforced structural discipline
    • Sector(s): Software/DevTools, Agent Platforms
    • Concept: New agent loops that plan modules, enforce boundaries, and redistribute complexity automatically after each turn (preventing “complexity mass” from collapsing into single functions).
    • Potential products: Architect-in-the-loop agents; “trajectory refactoring” planners that run after every change; structure-preserving code generators.
    • Dependencies: Advances in program synthesis, modularity constraints, and multi-turn planning; robust refactoring safety nets; integration with IDEs/CI.
  • Academia/ML — Training and evaluation with trajectory-level rewards
    • Sector(s): ML research, LLM training
    • Concept: Use erosion/verbosity slope as part of the reward function for RL or DPO, or as reranking signals; generate synthetic iterative curricula that force architectural decisions.
    • Potential products: Datasets of multi-checkpoint tasks; open-source reward models for “extension robustness.”
    • Dependencies: Stable, non-gamable quality metrics; scalable generation of iterative tasks across domains and languages.
  • Industry/Standards — Maintainability and “slop” metrics as part of quality standards
    • Sector(s): Standards bodies, Regulated industries (healthcare, finance, automotive)
    • Concept: Establish norms for extension robustness (e.g., maximum allowed increase in erosion per release), and require reporting for AI-generated code.
    • Potential products: Industry guides; certification programs; procurement checklists mandating long-horizon benchmarks.
    • Dependencies: Consensus on metrics, thresholds, and audits; mapping to ISO/IEC 25010 and domain standards (e.g., ISO 62304 for medical software).
  • Cross-language and domain-specific SlopCodeBench suites
    • Sector(s): Embedded systems, Robotics, Data/ETL, Cloud infra, Energy
    • Concept: Black-box, language-agnostic benchmarks tailored to other ecosystems (C/C++ for embedded/robotics; JVM for enterprise; SQL/ETL for data pipelines) with evolving specs.
    • Potential products: “SlopCodeBench-Embedded,” “-Data,” “-Infra”; vendor-neutral challenge tracks.
    • Dependencies: High-quality parsers/clone detectors per language; realistic domain checkpoints; cost-effective harnesses.
  • Self-healing refactoring services
    • Sector(s): DevTools, Platform Engineering
    • Concept: Always-on services that observe CI results and proactively redistribute complexity, deduplicate code, and extract modules between sprints.
    • Potential products: SaaS refactoring copilots; “budget-aware” maintainers that keep erosion/verbosity within SLOs.
    • Dependencies: Safe automated refactoring at scale; semantic diffing; test generation to validate behavior; change management.
  • Governance/Policy — Regulatory audits of AI-generated code longevity
    • Sector(s): Government, Critical infrastructure, Defense
    • Concept: Include long-horizon maintainability checks in audits (e.g., “simulate three requirement changes; demonstrate quality stability”).
    • Potential products: Audit frameworks; guidance for public-sector AI procurement; risk scoring for code longevity.
    • Dependencies: Auditor capacity; standardized test suites; confidentiality controls for black-box testing; avoiding training contamination.
  • Insurance and risk pricing for AI-assisted software
    • Sector(s): Insurance, FinTech
    • Concept: Underwrite operational risk based on extension robustness metrics and trajectories; discounts for teams with strong slop gates and refactoring practices.
    • Potential products: Code maintainability risk scores; premium adjustments tied to CI telemetry.
    • Dependencies: Empirical linkage between metrics and incident rates; secure data sharing; defenses against metric gaming.
  • IDEs with proactive “slop budgets” and architectural nudging
    • Sector(s): IDEs, Developer experience
    • Concept: Real-time feedback on mass concentration and duplication; “design-first” scaffolding; nudges to create extension points early.
    • Potential products: Next-gen IDEs embedding erosion/verbosity HUDs; pattern-level code actions.
    • Dependencies: Low-latency analysis; unobtrusive UX; accurate cross-project symbol analysis; developer acceptance.
  • Beyond code — Extension-robustness benchmarks for other agentic domains
    • Sector(s): Workflow automation, Data science/ML pipelines, Robotics
    • Concept: Iterative benchmarks with evolving requirements for workflows, DAGs, and control policies; track “process slop” (redundant steps, brittle hubs).
    • Potential products: “SlopFlowBench” for automation pipelines; long-horizon DAG quality metrics.
    • Dependencies: Domain-specific observability; comparable quality measures; execution sandboxes.
  • Enterprise SLOs for code health
    • Sector(s): Enterprise IT
    • Concept: Treat erosion/verbosity slope as SLOs, reviewed at release gates; tie OKRs and incentives to code longevity.
    • Potential products: Health dashboards; executive reports; portfolio risk heatmaps.
    • Dependencies: Organization-wide adoption; calibration by system complexity; guardrails to avoid perverse incentives.

Notes on Assumptions and Dependencies Across Applications

  • Language coverage: The current results focus on Python; extending metrics to other languages requires high-quality ASTs, clone detectors, and threshold calibration.
  • Metric validity and gaming: Erosion/verbosity correlate with maintainability but are partial views; combine with human review and additional metrics (cohesion/coupling, test quality). Watch for gaming (e.g., splitting functions without improving design).
  • Cost and compute: Long-horizon runs are time- and cost-intensive; organizations should sample tasks representative of their workload.
  • Hidden tests and black-box evaluation: Black-box practices reduce architectural leakage but require careful spec design and robust harness isolation (e.g., Dockerized runs).
  • Data contamination: When benchmarking proprietary or open-source history, ensure models haven’t seen targets or commit trails.
  • Culture and change management: Successful adoption hinges on developer buy-in, education, and measured thresholds that respect necessary complexity.

Glossary

  • Agent harness: The command-line or tooling framework that runs and coordinates an AI coding agent. "We therefore evaluate agents in their native harnesses rather than frameworks such as MiniSWEAgent."
  • Agentic coding benchmarks: Evaluations focused on autonomous coding agents, often measuring single-shot performance against complete specs. "agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications."
  • AST-based pattern matching: Matching code structures using abstract syntax trees rather than raw text. "AST-based pattern matching with metavariable capture."
  • AST-Grep: A tree-based code query and linting tool used to detect syntactic patterns. "constructing 137 targeted AST-Grep rules."
  • Black-box contract: A specification that constrains only observable behavior, not internal implementation details. "each task is a black-box contract that's implementable in any language."
  • Black-box testing: Assessing software solely through its external interfaces without inspecting internals. "Black-box testing."
  • CLI contract: The precise command-line interface (arguments and outputs) a tool must expose. "specifies the CLI contract: <root_dir> --rules <file> [--encoding <name>]"
  • Clone lines: Lines identified as duplicated code fragments used to quantify redundancy. "clone lines normalized by LOC."
  • Complexity mass: A per-function metric combining cyclomatic complexity and size to capture structural burden. "assign each callable a complexity mass"
  • Cyclomatic complexity (CC): A measure of the number of independent paths through code, reflecting decision-point complexity. "where CC(f) is the cyclomatic complexity of callable f"
  • Docker container: An isolated runtime environment packaging software and dependencies for reproducible execution. "Each checkpoint runs in a fresh Docker container under a non-root user."
  • Fail-to-pass tests: An evaluation regime where failing tests guide development until they pass, often revealing specific target behaviors. "The dominant SWE evaluation paradigm provides fail-to-pass tests"
  • Headless mode: Running an application without a graphical interface or interactive UI. "then invoke each in headless mode."
  • High-complexity threshold: The cutoff beyond which functions are considered highly complex for erosion analysis. "functions exceeding a high-complexity threshold:"
  • Isolation (ISO) correctness: Passing only the current checkpoint’s non-regression tests, ignoring past requirements. "we also report correct in isolation (ISO)"
  • Language-agnostic: Designed to work across multiple programming languages without relying on language-specific internals. "a language-agnostic benchmark comprising 20 problems and 93 checkpoints"
  • Metavariable capture: Binding placeholders in AST patterns to concrete subtrees or tokens during matching. "AST-based pattern matching with metavariable capture."
  • Normalization guidance: Instructions that specify acceptable formatting or ordering to avoid false failures. "Specifications add normalization guidance only where arbitrary choices could cause false failures"
  • Oracle-derived signals: Correctness feedback obtained from test oracles rather than from open-ended specifications. "test- or oracle-derived signals"
  • Radon: A Python static analysis tool commonly used to compute cyclomatic complexity and related metrics. "the popular code analysis tool Radon."
  • Reasoning effort: A configuration parameter controlling how much internal computation the model expends on a task. "we set the reasoning effort parameter to high."
  • Regression tests: Previously introduced tests retained to ensure earlier behaviors continue to work. "Regression — All tests from prior checkpoints."
  • Selector rules: Structured selection expressions over AST nodes used to target code patterns and transformations. "Selector rules and auto-fix functionality."
  • Semantic source-file search: Searching code by structural or meaning-based patterns rather than raw text. "builds a CLI tool for semantic source-file search"
  • SLOC: Source Lines of Code; a size metric for code artifacts. "and SLOC(f) is its source lines of code."
  • Structural attractors: Recurring structural patterns that code tends to converge to under iterative edits. "Code converges toward structural attractors"
  • Structural duplication: Repeated code fragments across the codebase indicating redundancy. "we measure structural duplication: clone lines normalized by LOC."
  • Structural erosion: Increasing concentration of complexity in a small set of already complex functions over time. "structural erosion, the share of complexity mass concentrated in high-complexity functions."
  • Thinking-token budget: A limit on tokens allocated to the model’s explicit reasoning process. "we configure the thinking-token budget via the environment variable"
  • Trajectory-level quality signals: Metrics tracked across multiple iterations to assess how quality evolves. "two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions."
  • Verbosity: The proportion of redundant or duplicated code and unnecessary scaffolding in a codebase. "verbosity, the fraction of redundant or duplicated code"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 448 likes about this paper.