SlopCodeBench: Evaluating Iterative Code Quality

Updated 28 March 2026

SlopCodeBench is a language-agnostic benchmark designed to assess code degradation through trajectory-level metrics such as structural erosion and verbosity.
It simulates iterative software development where agents extend earlier solutions amid evolving requirements, revealing challenges in maintainability.
Empirical results show that agent-generated code increasingly suffers from bloat and complexity, underscoring a deficiency in architectural discipline compared to human code.

SlopCodeBench (SCBench) is a language-agnostic benchmark designed to evaluate how code generated by agentic coding systems degrades during long-horizon, iterative software development tasks. Unlike single-shot evaluations or benchmarks that tightly constrain agent design, SCBench assesses the robustness and maintainability of code as agents repeatedly extend their own prior solutions in response to evolving requirements—a setting intended to emulate the realities of sustained software engineering. SCBench introduces trajectory-level quality metrics capturing structural erosion and verbosity, and benchmarks numerous LLM agents on 20 diverse problems comprising 93 checkpoints, in direct comparison to a panel of human-developed open-source repositories. The findings systematically demonstrate that current agentic approaches lack the architectural discipline required for maintainable, extensible code in iterative settings (Orlanski et al., 25 Mar 2026).

1. Design Motivation and Benchmark Structure

SlopCodeBench was developed in response to the limitations of conventional LLM-coding benchmarks, which almost exclusively measure a model’s ability to generate code that passes a static test suite given a complete specification. In practice, software is rarely developed this way; initial code is repeatedly extended and modified, and early architectural choices can have outsized influence on subsequent development. Existing iterative benchmarks generally either decompose tasks with gold code at each substep or over-constrain design choices, failing to capture how code quality shapes future extensibility.

SCBench is distinguished by four core design features:

Iterative Self-Extension: At each checkpoint $i$ , the agent receives only the updated specification $\mathit{Spec}_i$ and its own codebase from checkpoint $i-1$ ; neither a conversation log nor gold code are supplied. Only the workspace persists across iterations.
Evolving Specifications: Each of 20 problems consists of a sequence of 3–8 checkpoints (93 in total), each introducing new features, requirements, or behaviors. Regression tests are maintained, ensuring backwards compatibility.
Black-Box, Language-Agnostic Conditions: Specifications restrict only CLI arguments or API interfaces, with hidden test suites to prevent agents from inferring inner architectural constraints. While SCBench is language-agnostic, results reported use only Python.
Unprescribed Internal Structure: Unlike benchmarks such as Humaneval or MBPP, SCBench does not dictate function signatures or module boundaries, obligating agents to make and extend their own architectural abstractions.

A representative example is the code_search problem, which proceeds through checkpoints such as adding new language support, AST-pattern matching, and other forcing functions that penalize hard-coded early decisions.

2. Quality Metrics and Formal Definitions

SCBench evaluates agent performance beyond correctness by introducing two trajectory-level metrics:

Structural Erosion $E$ :

Each callable $f$ in the codebase is assigned a "mass"

$\mathrm{mass}(f) = \mathrm{CC}(f) \times \sqrt{\mathrm{SLOC}(f)}$

with $\mathrm{CC}(f)$ denoting cyclomatic complexity and $\mathrm{SLOC}(f)$ the number of source lines of code. Erosion is then the fraction of total mass contained in “high-complexity” callables ( $\mathrm{CC} > 10$ ): $E = \frac{\sum_{f:\,\mathrm{CC}(f)>10}\mathrm{mass}(f)}{\sum_{f}\mathrm{mass}(f)}$ Structural erosion indicates architectural decay, such as bloated dispatcher functions accruing multiple branching responsibilities as requirements accumulate.

Verbosity $V$ :

Verbosity is determined by counting the distinct union of lines flagged either by 137 curated AST-grep rules (that target redundant or wasteful patterns) or detected as code clones, normalized by total LOC: $V = \frac{\bigl|\{\text{AST-Grep Flagged Lines}\} \cup \{\text{Clone Lines}\}\bigr|}{\text{LOC}}$ This metric isolates code bloat irrespective of function complexity.

Both metrics capture qualitative degradation in maintainability and extensibility that are not reflected in ephemeral pass-rates.

3. Experimental Protocol and Agent Population

Checkpoints are executed in isolated, non-root Docker containers, wiping shell history and installed packages between runs, with only the evolving working directory preserved. Each agent is given two hours of wall-clock time per checkpoint, no hard cap on API calls or tokens, and receives only a minimal “just-solve” prompt that asks for a correct solution and maintenance of requirements.txt.

The benchmark includes 11 frontier agent configurations:

Model Name	Implementation Details
Sonnet 4.5	Claude Code 2.0.65
Sonnet 4.6	Claude Code 2.1.44
Opus 4.5	Claude Code 2.0.51
Opus 4.6	Claude Code 2.1.32
GPT 5.1 Codex	Codex CLI 0.65.0
GPT 5.2	Codex CLI 0.71.0
GPT 5.2 Codex	Codex CLI 0.80.0
GPT 5.3 Spark	0.100.0
GPT 5.3 Codex	0.98.0
GPT 5.4	0.110.0
GLM 4.7	Claude Code 2.0.76

Human baselines draw from 48 open-source Python repositories (grouped by GitHub stars) and up to 30 longitudinal commits (568 temporal snapshots) for 20 of these repositories.

4. Empirical Findings and Comparative Results

Key findings include:

Solve Rates: No model completes any problem end-to-end. The highest strict per-checkpoint solve rate is 17.2% (Opus 4.6). Isolated (without regression) rates achieve ~23.7%; core test (spec-demonstrated only) rates reach ~53.8%.
Progressive Degradation: Across all agents and problems, structural erosion rises in 80% of trajectories, and verbosity in 89.8%. The mean number of high-CC functions increases from 4.1 to 37.0, and the maximum CC escalates from 27.1 to 68.2 throughout the problem sequence.
Escalating Resource Use, Declining Correctness: Average cost per checkpoint grows by ~2.9× from start to finish, but strict pass-rate dwindles to <0.5%. Core test rates remain comparatively stable (30–40%).
Agent vs. Human Code: Across 990 agent checkpoints, mean verbosity is $V=0.33$ and erosion $E=0.68$ , compared to $V=0.15$ and $E=0.31$ for 48 human repositories. Over time, agent code shows a median verbosity growth of 43% (compared to 25% for humans) and erosion growth in 79% of cases (vs. 55% for humans). Human code quality remains comparatively flat by both measures.

5. Prompt-Intervention Analysis

Prompt modifications were assessed for their potential to arrest trajectory-level code degradation:

anti_slop: This prompt penalizes “god functions,” trivial wrappers, excessive nesting, single-use variables, and defensive scaffolding.
plan_first: This prompt enforces an explicit cycle of planning, prototyping, testing, and refactoring, alongside anti-slop policies.

Empirically, such interventions reduce the initial verbosity by ≈33% and erosion by ≈20% (anti_slop), but fail to alter the rate of subsequent quality decay relative to the baseline “just_solve.” There is no statistically significant improvement in any solve-rate metric (paired Wilcoxon $p > 0.05$ ), and anti_slop in particular increases computational cost (up to 48% more on GPT 5.4) and can even reduce correctness on harder checkpoints.

6. Implications for Benchmarking and Future Research

SlopCodeBench demonstrates that existing pass-rate benchmarks systematically underestimate the fragility and technical debt accumulated in agent-produced code subject to iterative extensions. Such code may pass current tests yet become brittle or inextensible when requirements evolve, accruing “slop” (bloat and erosion) that traditional metrics completely overlook.

The underlying failure mode is not addressable via increased reasoning token budgets or parameter tuning, but rather reflects a deficit of design discipline in agentic workflow. Effective remedies may require:

Training-time objectives or reinforcement learning shaped by code-quality signals such as structural erosion and verbosity.
Tooling for architectural refactoring, modularity enforcement, and continuous static analysis, with feedback loops to reject patches increasing bloat or complexity without paired refactoring.
Agent architectures supporting hierarchical planning, abstraction boundary setting, and dynamic reorganization of workspaces to separate persistent “boilerplate” from incremental “feature” code.
Automated in-loop quality gates, with flexible thresholds tied to project lifecycle and change scope.

A plausible implication is that continued benchmark innovation—such as trajectory-level, open-ended, agent-centric evaluations typified by SCBench—is essential to shift focus from ephemeral test suite compliance toward durable, extensible quality in agentic software engineering (Orlanski et al., 25 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SlopCodeBench.