- The paper presents a behavior-driven fuzzing framework that abstracts developer-reported failures into reusable test cases for AI coding agents.
- It outlines a five-stage pipeline—from anomaly mining to automated trace analysis—ensuring rigorous, repository-grounded evaluation.
- Experimental results across multiple coding agents reveal diverse process-level anomalies, underscoring the need for enhanced robustness testing.
Behavior-Driven Fuzzing for AI Coding Agents: An In-Depth Analysis of ABTest
Motivation and Context
As LLM-based coding agents reach production deployment for tasks like code synthesis, refactoring, debugging, and repository management, the reliability of such agents in realistic, multistep, user-driven workflows is of critical concern. Empirical evidence suggests that deployment-time behavioral anomalies frequently escape detection by conventional evaluation pipelines, which typically rely on static, single-shot benchmarks focusing solely on functional correctness. There remains a significant gap between synthetic task-based evaluation and the kinds of nuanced, process-level errors that emerge in real-world agent workflows—including non-graceful error handling, hallucinated outputs, inappropriate file manipulations, and failures to honor stated constraints.
To address this shortcoming, "ABTest: Behavior-Driven Testing for AI Coding Agents" (2604.03362) introduces a novel methodology for grounding agent evaluation directly in historical user-reported failure data. The ABTest framework systematically abstracts these human-encountered agent failures into reusable behavioral patterns, generating a large set of repository-grounded fuzzing test cases. This enables the controlled, scalable, high-fidelity evaluation of agent robustness across a spectrum of realistic failure modes.
Framework Overview
The ABTest pipeline is structured into five main stages:
- Mining and Abstraction: Extraction of high-quality, developer-confirmed anomaly reports from public code agent repositories, followed by abstraction into a compact set of recurring Interaction Patterns (IPs) and Action Types (ATs).
- Seed Template Synthesis: Systematic composition of compatible IP-AT pairs into seed templates, capturing workflows amenable to test case instantiation.
- Repository-Grounded Instantiation: Generation of concrete, multi-step fuzzing cases by binding templates to paths, artifacts, and constraints in a real open-source repository.
- Agent-Driven Execution and Trace Capture: Stepwise issuance of task instructions to the agent, with fine-grained recording of prompts, traces, artifacts, and workspace mutations.
- Automated and Manual Anomaly Detection: Layered validation using oracles for process-level contract satisfaction, followed by manual analysis to adjudicate true positive anomalies.

Figure 1: An architectural overview of ABTest, detailing the end-to-end process from anomaly mining through test execution and anomaly validation.
Abstraction of Real-World Failures
Central to ABTest is its mechanism for operationalizing historical failures. The authors manually curate and decompose over 400 developer-validated, user-reported issues from three production-grade coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. Each report is abstracted to expose:
- Interaction Patterns: High-level, user-intent driven workflows (e.g., file manipulation with rollback and verification, multi-artifact validation, bounded resource execution).
- Action Types: Specific, tool-level operations representing the locus of agent error (e.g., move, patch, rollback, environment setup).
For example, an issue in Gemini CLI where an agent “loses” user files during in-folder organization (Figure 2) is abstracted into a pattern involving multi-step file operations, intermediate validation, and error recovery.

Figure 2: Simplified report and transcript of a Gemini CLI incident involving file loss during directory organization, illustrating a complex, multi-stage anomaly.
Notably, extractive decisions are guided by downstream reusability and observability in fuzzing, rather than mere surface similarity.

Figure 3: Transcript excerpt showing agent claims and eventual admission of partial failure, indicating a critical contract violation.
Seed Generation and Instantiation
The fuzzing workload is systematically constructed by generating all compatible combinations from the 47 deduced Interaction Patterns and 128 Action Types, filtered for workflow-action compositionality. Each seed template is represented structurally with workflow skeletons, anomaly surfaces, and required environmental constraints.

Figure 4: An example seed template built from a compatible Interaction Pattern – Action Type pair, preserving test abstract structure and anomaly target.
Templates are then instantiated as concrete task cases on a real open-source repository (Pallets Click), binding template variables to actual file paths, commands, and observable outcomes.

Figure 5: Instantiated test case showing instruction sequences and artifact expectations mapped onto an executable repository environment.
Automated Execution and Trace Analysis
For each coding agent, every test case is executed in a fresh, isolated workspace, preserving agent context throughout the test. Execution traces are exhaustively logged, covering per-step prompts, tool invocations, command outcomes, file diffs, and terminal artifacts.

Figure 6: Trace capture from a rollback verification step, showing prompt, event trace, command output, and artifact mutation evidence.
Automated oracles filter for contract violations, unexpected mutations, incomplete rollbacks, or misleading agent claims. Candidate anomalies are then surfaced for manual adjudication.
Evaluation Methodology and Numerical Results
The authors evaluate ABTest on three prominent coding agent toolchains—Claude Code (Claude 4.5/3.5 Haiku), OpenAI Codex CLI (GPT-5.1-Codex-Mini, GPT-4o-mini), and Gemini CLI (Gemini 2.5 Flash-Lite)—across 647 distinct repository-backed test cases per configuration, yielding a total of 3,235 agent-task pairs.
Key metrics:
- Total anomalies flagged: 1,573 (across 3,235 test runs)
- True anomalies (after manual curation): 642
- Precision: 40.8%
Breakdown per configuration (manually confirmed anomalies / flagged anomalies, precision):
| Agent/Model |
True Anomalies |
Flagged Anomalies |
Precision |
| Codex CLI (GPT-5.1) |
166 |
277 |
59.9% |
| Codex CLI (GPT-4o) |
95 |
334 |
28.4% |
| Claude Code (4.5 Haiku) |
119 |
259 |
45.9% |
| Claude Code (3.5 Haiku) |
87 |
376 |
23.1% |
| Gemini CLI (2.5 Flash) |
175 |
327 |
53.5% |
Notable findings:
- A large fraction of confirmed anomalies are minor but actionable (unexpected side effects, inconsistent responses).
- Gemini CLI shows the highest fraction of “critical anomalies” involving disruptive repo mutations or non-graceful failure propagation.
- Overlap analysis reveals that most detected anomalies are LLM-specific, with few shared between agent model variants—underscoring that robustness must be evaluated across model backends.
Process-Level Behavioral Insights
ABTest highlights several failure modes invisible to functionally-oriented benchmarks:
- Artifact inconsistency: Agents fabricating outputs or mismatching success claims with on-disk evidence.
- Incomplete or spurious rollback: Rollbacks that either fail silently or propagate partial/inconsistent state.
- Workflow drift: Agents persisting unrelated artifacts, expanding tasks beyond requested boundaries, or failing to halt upon error detection.
- Agent-specific pathologies: Certain LLM backends exhibit “over-acting” (e.g., GPT-5.1), preferring spurious completion, while others “under-complete” or substitute helper workflows when tasks falter.
These patterns provide actionable targets for agent stack hardening, input validation, and improved user feedback loops.
Implications and Future Directions
Practical implications: The ABTest methodology provides a scalable protocol for regression testing, cross-agent and cross-model comparison, and the detection of high-value robustness bugs before deployment. The abstraction layer enables transfer across repositories, agent architectures, and workflow domains; by leveraging historical user failures, the framework stays aligned with real-world risk profiles.
Theoretical implications: The evidence that behavioral robustness properties are strongly model-dependent, even under fixed agent scaffolds, suggests the need for evaluation regimes that explicitly separate agentic orchestration from core model reasoning characteristics. It also implies that traditional accuracy- or pass-rate metrics will underestimate the operational risk surface for deployed agents.
Future work should explore:
- Extension to multiple and more diverse software repositories.
- Automated anomaly explanation and root cause categorization to complement detection.
- Closed-loop feedback between fuzzing outputs and model/agent retraining pipelines.
- Integration with formal verification or property testing to strengthen process-level contract checks.
Conclusion
ABTest introduces a rigorous, behavior-driven fuzzing methodology for systematically uncovering process-level anomalies in LLM-based coding agents, grounded directly in developer-user failure reports (2604.03362). Through abstraction, controlled test synthesis, repository integration, and multi-layered validation, ABTest exposes a rich taxonomy of actionable agent failures, demonstrates strong precision in anomaly discovery, and provides novel insights into the model-conditioned robustness landscape of contemporary coding agents.