Red-Teaming the Agentic Red-Team

Published 23 Jun 2026 in cs.CR and cs.AI | (2606.24496v1)

Abstract: The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws that enable an active adversary to exfiltrate API keys, establish persistent footholds, and fully compromise the operator's machine, even when the agent operates inside a sandboxed container. To support our analysis, we introduce a full cyber kill chain for such agentic systems, capturing the progression from initial LLM manipulation to lateral movement, persistence, guardrail bypass, and sandbox escape. Building on our security analysis, we derive a robust architecture for agentic offensive-security tools and propose actionable, broadly applicable design principles that mitigate the disclosed attack paths at the architectural level.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a systematic security evaluation of agentic red-team tools, demonstrating near-deterministic exploitability (97.8%) across multiple LLMs.
It identifies critical architectural flaws such as unbounded privileges and guardrail failures that enable escalation, sandbox escape, and persistent compromise.
The study proposes a secure, least-privilege framework emphasizing strict worker-orchestrator separation and controlled communication to mitigate risks.

Security Analysis of Agentic Offensive-Red-Team Systems

Introduction

This paper systematically examines agentic offensive-security systems—LLM-driven autonomous agents used for penetration testing, adversary emulation, and red teaming. The work provides the first rigorous, empirical security evaluation of these tools, revealing that their architectures contain critical vulnerabilities that enable adversarial manipulation resulting in remote code execution, persistent agent weaponization, guardrail bypass, secrets exfiltration, and host system compromise. The analysis spans twelve leading open-source agentic-red-team platforms, employing a formalized kill chain to structure the threat assessment and demonstrating near-deterministic exploitability across multiple LLMs, including frontier models. The authors further propose a security-first architecture rooted in containment and least-privilege principles rather than faith in LLM prompt-hardening.

Threat Model and Architectural Abstraction

The paper frames its adversary as the controller of a target system subjected to agentic-red-team operations, seeking to exploit the agent's behavior to compromise the operator's infrastructure. The core architectural abstraction divides these tools into orchestrator (agentic harness + memory/skills), worker (execution/sandbox), and UI components. While sandboxing and guardrails are commonplace, systemic architectural flaws enable adversarial escalation regardless of defenses.

Kill Chain: Stepwise Vulnerability and Escalation Model

A bespoke kill chain models the standardized attack trajectory:

Worker RCE via Manipulation: Attacker leverages honeypot staging and reward hacking—rather than prompt injection—to induce the agent to download and execute crafted binaries containing self-planted vulnerabilities. This technique achieves a 97.8% success rate across ten agentic-red-teams and six frontier LLMs, including Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Failures predominantly stem from guardrails blocking the pentest operation itself, not payload refusal.
Privilege Escalation: Inadequate network/file-system isolation and unauthenticated intra-orchestration APIs enable lateral movement from worker to orchestrator, conferring elevated privileges and access to orchestration secrets.
Persistence: Attacker maintains foothold through manipulation of persistent code/configuration (hard persistence) or poisoning agentic memory/skills (soft persistence), allowing cross-session control even after worker resets.
Sandbox Escape: Over-privileged containers (e.g., --privileged, --network=host, docker.sock mounts) and misconfigured orchestrator APIs facilitate leap from sandboxed worker to host machine.
Host Compromise: Attacker achieves full remote code execution on operator's host, nullifying sandboxing and guardrails.

Intermediate objectives include persistent, unbounded agent weaponization (bypassing all guardrails) and secrets exfiltration (LLM API keys, cross-session logs), achieved in 10–11 out of 12 evaluated agents.

Contradictory Claims and Numerical Results

The authors show that prompt-injection-free manipulation completely bypasses model-level defenses and classic guardrails, contradicting assumptions that frontier LLMs and prompt-hardening techniques would suffice against adversarial manipulation. Deep enforcement at OS-level, not orchestrator-level, is necessary; architectural optimism regarding LLM reliability is refuted by deterministic exploitation in practice.

The empirical pipeline yields the following numerical highlights:

97.8% overall exploit success rate against frontier LLMs and agentic harnesses
Host compromise achieved in 8/12 systems; sandbox escape in 10/12
Secrets exfiltration and persistent weaponization achievable in 10–11/12 agents
No guardrail mechanism tested prevented actual exploitation

Root Causes and Architectural Pathology

Key root causes identified:

Guardrails not enforced at worker/OS level, allowing unbounded execution if worker is compromised.
Flat container architectures enabling leakage of orchestrator/host secrets.
Excessive privileges granted to containers for operational convenience, voiding isolation (e.g., mounting docker.sock, granting NET_RAW).
Orchestrator APIs exposed on the host network without authentication, enabling escalation from compromised worker.
Persistent memory and skills insufficiently protected against poisoning.

Contrary to popular belief, memory corruption and reward hacking maneuvers reliably induce RCE, unmitigated by prompt-inspection, static/dynamic analysis, or LLM system prompts—even in current frontier models.

Secure Architecture and Design Principles

In response, the authors propose a robust blueprint grounded in realistic adversarial assumptions:

Strict Worker-Orchestrator Separation: Ephemeral, least-privileged containers for worker; all secrets and sensitive computation isolated to orchestrator in distinct network/OS segments.
Minimized Communication Channels: Controlled APIs for task-specific operations, command-execution separated from artifact transfer, explicit human gating for potentially dangerous integrals.
Deep Guardrails/Containment: OS-level egress proxy controlling outbound worker network traffic, application-level enforcement outside worker influence; all worker outputs treated as potentially malicious.
Least-Privileged Scoped Workers: Discrete containers for privileged tools (e.g., port scanners), exposed only through narrowly-scoped APIs, preventing arbitrary code execution in privileged environments.
Mitigation of Persistence Vectors: Minimize inter-session memory sharing; rigorous approval and inspection for injected skills/artifacts.
Assume Eventual LLM Compromise: Design for blast-radius containment rather than model hardening; treat worker as untrusted/malicious post-compromise.

Practical and Theoretical Implications

Practically, agentic offensive-security tools, even with advanced LLMs and current state-of-the-art defenses, remain vulnerable to principled adversaries, necessitating systemic architectural overhaul before field deployment. Architectures must assume LLM misbehavior as baseline rather than exception.

Theoretically, agentic systems illustrate a category of AI-integrated automation whose security is fundamentally limited by the action space and privilege level granted to the agent. Model-level safety techniques are insufficient, and adversarial robustness requires deep system-level defenses.

Future developments will require hybrid approaches combining containerized blast-radius containment, continuous OS-level enforcement, and explicit human oversight at high-impact operational boundaries. Memory poisoning and soft persistence remain substantial open problems, with automated oracles for safety assessment a poorly resolved challenge.

Conclusion

The paper establishes that the security of agentic offensive-security systems is critically undermined by architectural and privilege design flaws, enabling deterministic exploitation even with the latest LLMs and accompanying defenses. The only practical mitigation is a system architecture built around explicit containment and minimal privilege, assuming eventual worker compromise. These findings set a rigorous foundation for future secure AI/agent orchestration in security operations and, by extension, broader autonomous system deployment.

Markdown Report Issue