Towards Optimal Agentic Architectures for Offensive Security Tasks

Published 20 Apr 2026 in cs.CR and cs.AI | (2604.18718v1)

Abstract: Agentic security systems increasingly audit live targets with tool-using LLMs, but prior systems fix a single coordination topology, leaving unclear when additional agents help and when they only add cost. We treat topology choice as an empirical systems question. We introduce a controlled benchmark of 20 interactive targets (10 web/API and 10 binary), each exposing one endpoint-reachable ground-truth vulnerability, evaluated in whitebox and blackbox modes. The core study executes 600 runs over five architecture families, three model families, and both access modes, with a separate 60-run long-context pilot reported only in the appendix. On the completed core benchmark, detection-any reaches 58.0% and validated detection reaches 49.8%. MAS-Indep attains the highest validated detection rate (64.2%), while SAS is the strongest efficiency baseline at $0.058 per validated finding. Whitebox materially outperforms blackbox (67.0% vs. 32.7% validated detection), and web materially outperforms binary (74.3% vs. 25.3%). Bootstrap confidence intervals and paired target-level deltas show that the dominant effects are observability and domain, while some leading whitebox topologies remain statistically close. The main result is a non-monotonic cost-quality frontier: broader coordination can improve coverage, but it does not dominate once latency, token cost, and exploit-validation difficulty are taken into account.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that optimal agent coordination in security tasks is context-dependent, challenging the assumption that more agents always yield better performance.
It employs a controlled benchmark of five architectures across live targets, quantifying key metrics such as validated detection rates, token cost, and time-to-find vulnerabilities.
The study reveals that performance shifts significantly between whitebox and blackbox modes, underlining the need for adaptive routing policies in security auditing.

Agentic Architecture Benchmarking for Offensive Security: An Empirical Evaluation

Motivation and Research Framework

This paper addresses a notable gap in agentic security systems: the empirical assessment of agent coordination topologies in offensive security tasks. Prior implementations of tool-using LLM agents for security auditing (e.g., PentestGPT, MAPTA) typically fixed a single agent topology, thus obscuring the conditions under which additional agents provide tangible benefit or incur unnecessary overhead. The authors present a controlled benchmark to evaluate five canonical agent architectures across 20 live targets (10 web/API, 10 binary), exposed in both whitebox and blackbox modes. The primary aim is to isolate the effect of coordination structure on validated vulnerability detection rates, latency, and operational cost, under matched prompts, tools, and test budgets.

Benchmark Design and Evaluation Protocol

Security auditing is cast as an interactive, multi-step decision process necessitating evidence gathering, hypothesis refinement, and validation loops—demanding agentic architectures well beyond static next-token prediction. The benchmark suite comprises containerized targets exposing a single endpoint-reachable ground-truth vulnerability, each with deterministic resetability and machine-readable metadata to ensure experiment reproducibility.

Five architectures are implemented:

SAS (Single-Agent System): Sequential, deterministic scan and validation.
MAS-Indep: Independent parallel agents with late best-of-N selection.
MAS-Decent: Peer voting with majority-based scan selection.
MAS-Central: Hub-and-spoke orchestration with specialist validation.
MAS-Hybrid: Hierarchical routing with two-level candidate selection.

The prompt scaffolds and tool interfaces are held fixed, ensuring differences arise solely from coordination topology. Three model families (GPT-5.2, Claude Opus 4, Kimi K2) are evaluated, with all operations sandboxed within Docker containers.

Empirical Findings

Architecture-Level and Model-Level Tradeoffs

Results from 600 core runs indicate a non-monotonic cost-quality frontier:

MAS-Indep achieves the highest validated detection rate (64.2%) but at increased token and dollar cost ($0.143/finding, 111.9s median TTFV).
SAS remains highly competitive in validated recall (50.8%) and dominates efficiency ($0.058/finding, 53.0s TTFV).
Centralized and peer-voting topologies do not outperform SAS, highlighting that coordination overhead is not uniformly compensated by coverage gains.

Model performance shows Kimi K2 achieves the best validated detection (52.0%) at the lowest cost ($0.047/finding), while GPT-5.2 is more expensive ($0.258/finding) without commensurate accuracy improvement.

Observability Regimes and Domain Effects

The most pronounced performance deltas occur across observability and domain axes:

Whitebox access yields a 34.3-point uplift in validated detection (67.0% vs. 32.7% for blackbox).
Web targets are markedly easier than binary (74.3% validated vs. 25.3%).
The hardest cases are binary-blackbox, where naming the correct class does not reliably yield dynamic exploit validation.

All five architectures show clear regime shifts, but topologies that achieve the frontier in whitebox mode remain statistically close. Paired bootstrap intervals reinforce that MAS-Indep's gain is significantly larger in blackbox conditions.

Cost and Operational Overhead

Wider coordination broadens search but incurs substantial token and context duplication. Output-token-related costs dominate; whitebox exposure increases input-side spend. The SAS baseline retains speed and cost advantage, making it attractive for efficiency-centric deployments, while MAS-Indep offers maximal coverage for tasks demanding high recall.

Post-hoc Analyses and Adaptive Routing

The empirical matrix reveals no universally dominant architecture. The main practical implication is reframing architecture selection from a static design to a routing problem: the optimal topology depends on task features (surface entropy, exploit-chain depth, tool intensity, volatility) and operational constraints. Mixed-effects models separate architecture and regime-specific effects, enabling the prospect of learned adaptive routing policies. Such policies can maximize validated recall at fixed budgets, dynamically selecting architectures based on early-run signals and task metadata.

Practical and Theoretical Implications

This controlled benchmarking clarifies that agentic coordination is not inherently superior—its value emerges selectively under increased uncertainty (blackbox, binary) and larger exploratory surfaces. SAS remains a strong baseline for low-cost or latency-sensitive use, while broader topologies like MAS-Indep are justified in high-stakes or recall-driven scenarios. The findings challenge the dogma that more agents always yield better performance, emphasizing instead the necessity of context-aware, cost-conditioned architecture selection.

The benchmark establishes a replicable, artifact-rich foundation for auditing agentic architecture decisions. It enables future work on:

Automatic topology selection via learned policies,
Extension to larger, more heterogeneous real-world security workloads,
Integration of mixed domain and observability signals in adaptive agent orchestration.

Conclusion

The paper empirically studies agent coordination topologies for offensive security tasks, demonstrating that coordination structure significantly affects validated detection and operational cost. The main result is the characterization of a non-monotonic cost-quality frontier, not a single-winner paradigm. Architecture choice should be viewed as a routing problem contingent on task-specific conditions, with empirical trade-offs exposed via rigorous benchmarking. The methodology and findings provide actionable guidance for designing agentic systems targeting live, partially observable environments under budget constraints. Future work will likely explore learned, adaptive topology selection and broader benchmark coverage for production-scale security auditing.

Markdown Report Issue