Agentic Red Teaming

Updated 17 March 2026

Agentic red teaming is a paradigm that tests interactive AI systems by treating each action sequence as an attack surface for context-dependent vulnerability discovery.
It utilizes comprehensive observability frameworks and iterative, context-aware adversarial strategies to expose flaws that are not detectable through static testing.
The approach informs AI safety by distinguishing between model-level and agentic-only vulnerabilities, guiding the development of robust defense and anomaly detection techniques.

Agentic red teaming is an advanced paradigm for systematically probing, stress-testing, and ultimately revealing the vulnerabilities of AI systems, particularly LLMs embedded in interactive, tool-calling, or multi-agent execution loops. Unlike classical red teaming—where adversaries target stand-alone models in static prompt-response settings—agentic red teaming interrogates deployed agentic systems as interactive state machines, treating each sequence of actions (tool use, memory manipulation, agent hand-off) as a potential attack surface. This approach recognizes the expanded and context-dependent attack surface introduced by agentic architectures, and necessitates new methodologies and observability frameworks to faithfully characterize, exploit, and defend against emergent agentic-only vulnerabilities.

1. Conceptual Foundations of Agentic Red Teaming

Agentic red teaming redefines the notion of adversarial evaluation in line with the evolution of LLMs from isolated text-generation engines to embedded controllers within complex, context-rich agentic systems. At its core, the methodology distinguishes between two tiers:

Model-level red teaming: The LLM is treated as a static function $f : X \rightarrow Y$ , probed with crafted adversarial inputs $x \in X$ detached from the deployment environment.
Agentic-level red teaming: The deployed system is viewed as an interactive state machine comprising discrete “actions” (LLM calls, tool invocations, memory operations, agent transfer), unfolding in the context of full execution traces, tool outputs, and evolving memory—an irreducible agentic trajectory inaccessible to purely model-centric attacks (Wicaksono et al., 21 Sep 2025).

This shift is motivated by empirical evidence showing a divergence in vulnerability profiles: certain exploits only trigger in agentic contexts (agentic-only vulnerabilities), while some model-only attacks are neutralized within agentic execution loops, as mediating components may intercept or attenuate adversarial prompts.

2. Methodologies and Frameworks

Agentic red teaming architectures typically integrate three elements: (i) comprehensive observability over internal agent actions, (ii) adaptive attack generation aligned with agentic context, and (iii) iterative or multi-turn attack strategies capable of exploiting temporal dependencies and stateful behaviors.

Observability Frameworks

AgentSeer (Wicaksono et al., 21 Sep 2025): Parses agentic execution into an ordered set of atomic actions $\{a_i\}$ and a component set $\{c_j\}$ , building a directed knowledge graph $G=(V,E)$ capturing dependencies (e.g., via memory channel mediation). MLFlow spans are classified into semantic action types, facilitating fine-grained visualization of both chronological action graphs and component relationships.

Agentic Attack Protocols

Action Injection: Red teamers inject adversarial prompts at each agentic action point $a_i$ , conditioning not only on current prompts but also on context $C_i$ (conversation history, tool results, memory states) (Wicaksono et al., 21 Sep 2025, Syros et al., 9 Feb 2026).
Iterative Context-Aware Attacks: Prompt refinement progresses over multiple rounds, where each refinement $p_o^{(t+1)} = g(p_o^{(t)}, f(C_i \oplus p_o^{(t)}))$ leverages full context $C_i$ , outperforming model-only iterative attacks (Wicaksono et al., 21 Sep 2025).
Automated Evolutionary Search: Systems like AgenticRed (Yuan et al., 20 Jan 2026) treat red-teaming system design itself as a search problem—iteratively evolving multi-agent workflows (proposer-verifier, feedback loops) via LLM-driven meta-agents.
Adaptive Orchestration: Frameworks such as AJAR (Dou et al., 16 Jan 2026) employ protocol-driven architecture (MCP servers, Petri runtime, Auditor Agent) to coordinate multi-turn, tool-simulating attacks with stateful backtracking.

Task Domains

Agentic red teaming now spans:

Privacy leakage (system prompt and training-data extraction) (Nie et al., 2024).
Policy compliance breaking in customer-service or QA agents (Nakash et al., 11 Jun 2025).
Penetration testing and offensive cybersecurity using multi-stage tool access (Challita et al., 11 May 2025, Janjuesvic et al., 20 Nov 2025).
Multi-agent communication (e.g., Agent-in-the-Middle attacks) (He et al., 20 Feb 2025).
Web agent exploitation via adaptive, end-to-end indirect prompt injection (Syros et al., 9 Feb 2026).
Scenario-driven safety evaluation in multi-agent societies (Mehrabi et al., 20 Jun 2025).

3. Distinctive Findings, Agentic-Only Vulnerability Phenomena, and Quantitative Metrics

Empirical studies demonstrate that:

Agentic-only vulnerabilities: Certain adversarial objectives are unexploitable at the model level but become achievable in agentic settings due to emergent logic and context (e.g., file-access attacks succeeding at tool calls but inert at core LLM) (Wicaksono et al., 21 Sep 2025, Dou et al., 16 Jan 2026).
Contextual vulnerability gradients: Tool-calling actions have 24% higher attack success rates than non-tool actions (average ASR $_{tool}$ = 0.46 vs. ASR $_{non-tool}$ = 0.37) (Wicaksono et al., 21 Sep 2025). Agent-transfer operations can peak at 67% vulnerability.
Attack success rate (ASR): Varies by injection context (human, AI, or tool message), with iterative, context-aware attacks outperforming direct attacks by 4–6% (Wicaksono et al., 21 Sep 2025).
Attack instability: Attack effectiveness is volatile under reinjection, with a 50–80% drop when prompts are recycled, suggesting potential for context shuffling as a defense (Wicaksono et al., 21 Sep 2025).
Conversely, model-only vulnerabilities are not reliably reproduced in agentic loops, highlighting non-trivial dissociation between model and agentic attack surfaces.

Comprehensive tables and breakdowns of attack strategies and metrics are standard:

Injection Strategy	Direct ASR	Iterative ASR	Δ (Iter–Direct)
Human Message	57%	61%	+4%
AI Message	42%	48%	+6%
Tool Message	40%	46%	+6%

4. Architecture: System Components and Attack Pipelines

Advanced agentic red-teaming frameworks are modular, employing architectural strategies for adversarial exploration.

AJAR (Dou et al., 16 Jan 2026): Implements a cognitive Auditor Agent and MCP service servers (exposing multi-turn jailbreak algorithms, e.g., X-Teaming), all within a Petri-based agentic runtime emulating tools and environment state. Stateful backtracking and tool simulation support complex multi-turn, environment-aware exploitation pipelines.
AgentSeer (Wicaksono et al., 21 Sep 2025): Provides logging, action-type classification, and action graph visualization linking agentic events to component and memory dependencies, enabling precise surface identification.
RedTeamLLM (Challita et al., 11 May 2025): Utilizes a three-stage LLM-driven “OS” structure for penetration testing—summarization, chain-of-thought reasoning, and act—coupled to recursive planning, plan correction, and memory tracing.
MUZZLE (Syros et al., 9 Feb 2026): Adapts to web agents, replaying agent trajectories to identify salient injection surfaces, then automatically synthesizing and refining contextual payloads for end-to-end, cross-application prompt injection attacks.

5. Comparative Analysis and Emergent Security Gaps

Agentic red teaming not only uncovers vulnerabilities missed by classical one-shot or single-turn methodologies, but also demonstrates that integrated tool use, environmental feedback, and multi-agent interaction open qualitatively new exploit pathways:

Communication-based compromise: Embedding an adversarial agent ("Agent-in-the-Middle") in LLM-based multi-agent systems allows systemic manipulation undetectable by agents lacking authenticated, integrity-checked messaging (He et al., 20 Feb 2025).
Indirect prompt injection: Adaptive attack frameworks such as MUZZLE reveal that reply fields and user-generated content in web UIs constitute high-exploitability surfaces, enabling attacks violating confidentiality, integrity, and availability, including credential-phishing and cross-app workflow hijacks (Syros et al., 9 Feb 2026).
Policy circumvention: Multi-agent red-teamers employing plan extraction, deception planning, and avoidance advisory can defeat policy-adherent customer service agents at significantly higher rates than blunt jailbreak templates, even under strong policy reminders (Nakash et al., 11 Jun 2025).
Kaleidoscopic teaming: Iterative in-context scenario generation in synthetic multi-agent societies exposes vulnerabilities that emerge only through agent-agent interaction (deception, collusion), especially with targeted, contrastive scenario optimization (Mehrabi et al., 20 Jun 2025).

6. Implications for AI Safety, Defensive Practices, and Open Challenges

Agentic red teaming has driven several paradigm shifts and established new best practices in AI safety assessment:

Evaluations must be live and context-inclusive: Static, black-box prompt-response testing under-samples the true attack surface; red teaming should occur within the deployed agentic system under full environmental context (Wicaksono et al., 21 Sep 2025, Dou et al., 16 Jan 2026).
Tool and agent-transfer operations require explicit guardrails: Tool-calling, code execution, and inter-agent transfer are consistently higher-risk; require runtime policy checks and semantic filters (Wicaksono et al., 21 Sep 2025, Syros et al., 9 Feb 2026).
System-level observability is essential: Granular logging and action-graphing frameworks (e.g., AgentSeer, Petri/AJAR) are basic infrastructure for dissecting and defending complex agentic exploits (Wicaksono et al., 21 Sep 2025, Dou et al., 16 Jan 2026).
Multi-turn, dynamic adversarial strategies generalize best: Automated, learning-based agents leveraging evolutionary algorithms, reinforcement learning, and context-aware prompt adaptation consistently outperform manual and single-step methods, showing high transferability to proprietary models (Yuan et al., 20 Jan 2026, Xiong et al., 1 Jun 2025, Chen et al., 2 Apr 2025).
Defenses should exploit prompt instability and authentication: Attack prompts degrade with reinjection and transformation, suggesting utility for ephemeral prompt strategies and context rotation; inter-agent authentication and message validation are critical in multi-agent settings (He et al., 20 Feb 2025, Wicaksono et al., 21 Sep 2025).
Continuous, adversary-aware scenario generation is needed: Adaptive approaches such as in-context scenario optimization (kaleidoscopic teaming) or dynamic context-shuffling enable red teamers to maintain pressure as models and defenses co-evolve (Mehrabi et al., 20 Jun 2025).

Open research challenges include: framework-agnostic observability decoupled from incumbent libraries, principled co-evolution of defensive and adversarial agents, efficient multi-objective optimization over attack success rate and query cost, and formal verification of policy adherence in dialogic, multi-party agentic contexts.

7. Future Directions and Broader Impact

Agentic red teaming defines the emerging state-of-the-art in AI vulnerability discovery, with ongoing work emphasizing:

More computationally efficient agentic search and meta-optimization (Yuan et al., 20 Jan 2026).
Broader domain generality beyond current benchmarks (e.g., from sales analytics to critical infrastructure).
Defensive AI systems (blue-team LLMs and automated anomaly detectors) architected to mirror and counter agentic red teamers (Janjuesvic et al., 20 Nov 2025, Nakash et al., 11 Jun 2025).
Modular, extensible frameworks enabling plug-and-play adversarial logic, real-time scenario generation, and cross-system attack surface modeling (Dou et al., 16 Jan 2026, Xiong et al., 1 Jun 2025).
Rigorous, scenario-diverse evaluations through agent-society simulation (kaleidoscopic frameworks) and integration into proactive safety compliance pipelines (Mehrabi et al., 20 Jun 2025).

The agentic red teaming paradigm both expands attack coverage and compels a new class of formal, system-level precautions in AI deployment—solidifying its foundational role in safe, robust, and trustworthy next-generation AI systems.

Key citations:

"Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B" (Wicaksono et al., 21 Sep 2025)
"AJAR: Adaptive Jailbreak Architecture for Red-teaming" (Dou et al., 16 Jan 2026)
"Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming" (Janjuesvic et al., 20 Nov 2025)
"PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage" (Nie et al., 2024)
"Red-Teaming LLM Multi-Agent Systems via Communication Attacks" (He et al., 20 Feb 2025)
"MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks" (Syros et al., 9 Feb 2026)
"RedTeamLLM: an Agentic AI framework for offensive security" (Challita et al., 11 May 2025)
"AgenticRed: Optimizing Agentic Systems for Automated Red-teaming" (Yuan et al., 20 Jan 2026)
"Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning" (Chen et al., 2 Apr 2025)
"Kaleidoscopic Teaming in Multi Agent Simulations" (Mehrabi et al., 20 Jun 2025)
"Effective Red-Teaming of Policy-Adherent Agents" (Nakash et al., 11 Jun 2025)
"Automatic LLM Red Teaming" (Belaire et al., 6 Aug 2025)
"CoP: Agentic Red-teaming for LLMs using Composition of Principles" (Xiong et al., 1 Jun 2025)