- The paper presents AgentFlow, a unified typed DSL for multi-agent harness synthesis that integrates agent roles, communication, and tool allocations in a single optimization loop.
- It leverages structured runtime feedback—such as test verdicts and coverage maps—to diagnose failures and guide targeted modifications in harness design.
- Numerical results on TerminalBench-2 and Google Chrome highlight state-of-the-art performance and the discovery of previously unknown zero-day vulnerabilities.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
Motivation and Limitations of Existing Agentic Vulnerability Discovery
LLM agents have demonstrated capability in autonomously identifying real security vulnerabilities, even those missed by traditional human auditors and automated fuzzers. These advances are enabled by agentic systems which, rather than deploying a single LLM agent, orchestrate multiple specialized agents through a harness that defines their roles, prompts, tools, and communication topology. The harness mediates coordination, information sharing, and retries, substantially impacting success rates.
However, current harness optimization approaches are constrained by two principal limitations: (i) narrow search space—optimizing only a subset of harness parameters, often fixing roles or communication structures, and (ii) coarse feedback—typically relying on binary pass/fail signals with little diagnostic precision. As a result, these systems cannot adaptively redesign multi-agent orchestrations in response to failure modes localized through richer runtime feedback channels.
AgentFlow: Unified Harness Synthesis with Structured Feedback
AgentFlow addresses both limitations by introducing a typed graph Domain-Specific Language (DSL), jointly parameterizing agent roles, communication topology, message schemas, tool allocations, and coordination protocols in a single unified space. Harnesses are represented as programs in this DSL, allowing edits to any dimension within a single optimization step. AgentFlow enforces structural well-formedness via a type system, which ensures that malformed harnesses are rejected before expensive LLM evaluation.
Furthermore, AgentFlow leverages structured runtime feedback, consuming signals emitted by the target program—such as test verdicts, coverage maps, sanitizer reports, and agent traces. This enables fine-grained diagnosis of failure causes, allowing the outer-loop optimizer to localize bottlenecks and propose targeted harness modifications, rather than relying solely on scalar outcome signals.

Figure 2: High-level overview of the AgentFlow optimization loop, showing iterative proposal, execution, scoring, and diagnosis with structured feedback channels.
The paper formalizes a harness as a tuple (A,G,Σ,Φ,Ψ), covering agent set, communication topology, message schemas, tool allocations, and coordination protocols. The DSL represents harnesses as directed graphs: nodes for agents, edges for dataflow or guarded retry operations, templates for prompts and feedback channel bindings, and set operations for parallel agent ensembles (fan-out). Templates dynamically bind to upstream outputs and runtime structural feedback, and the well-formedness validator ensures that prompts reference only available data sources, edges carry information actually consumed, and all agents are reachable.
This formal abstraction subsumes prior harness optimizers, which vary only some components while holding others fixed. AgentFlow's joint search over all five harness dimensions enables cross-component edits crucial for achieving state-of-the-art performance.
Optimization Procedure and System Architecture
AgentFlow operates an outer-loop optimization algorithm, HarnessOpt, over the space of well-formed DSL programs. Each iteration comprises four stages: (1) proposing a new harness architecture via LLM calls conditioned on the archive and last diagnosis, (2) execution on a task set to gather agent traces and structured runtime feedback, (3) scoring via domain-specific metrics, and (4) diagnosis attributing failure to responsible agents or coordination links and recommending corrective edits. The archive maintains historical harnesses and outcomes to avoid redundant proposals and provide context for diagnosis.
Edits are validated for syntactic and type correctness and subjected to a smoke test before full deployment. Approximately 20% of proposed harnesses are rejected as malformed, preventing unnecessary computational expense.
Numerical Results: TerminalBench-2 and Chrome Vulnerabilities
AgentFlow was evaluated on TerminalBench-2 using Claude Opus 4.6, achieving a pass rate of 84.3%, the highest in the leaderboard snapshot and a 2.9 percentage point gap over the strongest hand-engineered baseline. The synthesis trajectory illustrates three optimization phases—infrastructure, specialization, and parallel ensemble construction—each targeting different harness layers. Cross-component edits are shown to be essential, as ablation studies indicate prompt-edits, structural-edits, and tool-edits are complementary.
On Google Chrome (with Kimi K2.5), AgentFlow discovered ten previously unknown zero-day vulnerabilities, including two critical sandbox escapes (CVE-2026-5280 and CVE-2026-6297), all confirmed by the vendor. The Chrome campaign leveraged subsystem-specific analysts, attack-surface planners, 192 parallel explorers, and multi-stage triage and validation pipelines, illustrating scalability and generalization beyond synthetic benchmarks.
Implications, Limitations, and Future Directions
AgentFlow demonstrates that harness engineering, not just LLM model capability, is the dominant factor in agentic vulnerability discovery. The joint DSL enables expressive search and adaptation across all coordination dimensions, and structured feedback channels facilitate precise diagnosis and targeted architectural edits. The practical implications span automated vulnerability discovery, exploit synthesis, and the automation of agentic workflows in complex software testing scenarios.
A notable limitation is the restriction to static topologies and program-level edits, avoiding within-execution dynamic harness changes. Future work may address dynamic harnesses, finer-grained parameterization, and the integration of richer feedback modalities, such as symbolic execution traces and semantic crash clustering. The approach indicates that harness synthesis—rather than solely increasing LLM scale or prompting—is pivotal for advancing agentic capabilities in AI-driven software security and offers a generalizable framework for optimizing multi-agent orchestration in broader AI domains.
Conclusion
AgentFlow advances automated vulnerability discovery through the synthesis of multi-agent harnesses within a typed graph DSL and structured feedback-driven optimization. By searching over all harness dimensions and consuming runtime diagnostic signals, AgentFlow sets a new performance bar for agentic systems in both synthetic and production codebases, highlighting orchestration as a primary locus of agentic AI progress (2604.20801).