- The paper demonstrates that optimal agent coordination in security tasks is context-dependent, challenging the assumption that more agents always yield better performance.
- It employs a controlled benchmark of five architectures across live targets, quantifying key metrics such as validated detection rates, token cost, and time-to-find vulnerabilities.
- The study reveals that performance shifts significantly between whitebox and blackbox modes, underlining the need for adaptive routing policies in security auditing.
Agentic Architecture Benchmarking for Offensive Security: An Empirical Evaluation
Motivation and Research Framework
This paper addresses a notable gap in agentic security systems: the empirical assessment of agent coordination topologies in offensive security tasks. Prior implementations of tool-using LLM agents for security auditing (e.g., PentestGPT, MAPTA) typically fixed a single agent topology, thus obscuring the conditions under which additional agents provide tangible benefit or incur unnecessary overhead. The authors present a controlled benchmark to evaluate five canonical agent architectures across 20 live targets (10 web/API, 10 binary), exposed in both whitebox and blackbox modes. The primary aim is to isolate the effect of coordination structure on validated vulnerability detection rates, latency, and operational cost, under matched prompts, tools, and test budgets.
Benchmark Design and Evaluation Protocol
Security auditing is cast as an interactive, multi-step decision process necessitating evidence gathering, hypothesis refinement, and validation loops—demanding agentic architectures well beyond static next-token prediction. The benchmark suite comprises containerized targets exposing a single endpoint-reachable ground-truth vulnerability, each with deterministic resetability and machine-readable metadata to ensure experiment reproducibility.
Five architectures are implemented:
- SAS (Single-Agent System): Sequential, deterministic scan and validation.
- MAS-Indep: Independent parallel agents with late best-of-N selection.
- MAS-Decent: Peer voting with majority-based scan selection.
- MAS-Central: Hub-and-spoke orchestration with specialist validation.
- MAS-Hybrid: Hierarchical routing with two-level candidate selection.
The prompt scaffolds and tool interfaces are held fixed, ensuring differences arise solely from coordination topology. Three model families (GPT-5.2, Claude Opus 4, Kimi K2) are evaluated, with all operations sandboxed within Docker containers.
Empirical Findings
Architecture-Level and Model-Level Tradeoffs
Results from 600 core runs indicate a non-monotonic cost-quality frontier:
- MAS-Indep achieves the highest validated detection rate (64.2%) but at increased token and dollar cost ($0.143/finding, 111.9s median TTFV).
- SAS remains highly competitive in validated recall (50.8%) and dominates efficiency ($0.058/finding, 53.0s TTFV).
- Centralized and peer-voting topologies do not outperform SAS, highlighting that coordination overhead is not uniformly compensated by coverage gains.
Model performance shows Kimi K2 achieves the best validated detection (52.0%) at the lowest cost ($0.047/finding), while GPT-5.2 is more expensive ($0.258/finding) without commensurate accuracy improvement.
Observability Regimes and Domain Effects
The most pronounced performance deltas occur across observability and domain axes:
- Whitebox access yields a 34.3-point uplift in validated detection (67.0% vs. 32.7% for blackbox).
- Web targets are markedly easier than binary (74.3% validated vs. 25.3%).
- The hardest cases are binary-blackbox, where naming the correct class does not reliably yield dynamic exploit validation.
All five architectures show clear regime shifts, but topologies that achieve the frontier in whitebox mode remain statistically close. Paired bootstrap intervals reinforce that MAS-Indep's gain is significantly larger in blackbox conditions.
Cost and Operational Overhead
Wider coordination broadens search but incurs substantial token and context duplication. Output-token-related costs dominate; whitebox exposure increases input-side spend. The SAS baseline retains speed and cost advantage, making it attractive for efficiency-centric deployments, while MAS-Indep offers maximal coverage for tasks demanding high recall.
Post-hoc Analyses and Adaptive Routing
The empirical matrix reveals no universally dominant architecture. The main practical implication is reframing architecture selection from a static design to a routing problem: the optimal topology depends on task features (surface entropy, exploit-chain depth, tool intensity, volatility) and operational constraints. Mixed-effects models separate architecture and regime-specific effects, enabling the prospect of learned adaptive routing policies. Such policies can maximize validated recall at fixed budgets, dynamically selecting architectures based on early-run signals and task metadata.
Practical and Theoretical Implications
This controlled benchmarking clarifies that agentic coordination is not inherently superior—its value emerges selectively under increased uncertainty (blackbox, binary) and larger exploratory surfaces. SAS remains a strong baseline for low-cost or latency-sensitive use, while broader topologies like MAS-Indep are justified in high-stakes or recall-driven scenarios. The findings challenge the dogma that more agents always yield better performance, emphasizing instead the necessity of context-aware, cost-conditioned architecture selection.
The benchmark establishes a replicable, artifact-rich foundation for auditing agentic architecture decisions. It enables future work on:
- Automatic topology selection via learned policies,
- Extension to larger, more heterogeneous real-world security workloads,
- Integration of mixed domain and observability signals in adaptive agent orchestration.
Conclusion
The paper empirically studies agent coordination topologies for offensive security tasks, demonstrating that coordination structure significantly affects validated detection and operational cost. The main result is the characterization of a non-monotonic cost-quality frontier, not a single-winner paradigm. Architecture choice should be viewed as a routing problem contingent on task-specific conditions, with empirical trade-offs exposed via rigorous benchmarking. The methodology and findings provide actionable guidance for designing agentic systems targeting live, partially observable environments under budget constraints. Future work will likely explore learned, adaptive topology selection and broader benchmark coverage for production-scale security auditing.