Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Red Teaming

Updated 17 March 2026
  • Agentic red teaming is a paradigm that tests interactive AI systems by treating each action sequence as an attack surface for context-dependent vulnerability discovery.
  • It utilizes comprehensive observability frameworks and iterative, context-aware adversarial strategies to expose flaws that are not detectable through static testing.
  • The approach informs AI safety by distinguishing between model-level and agentic-only vulnerabilities, guiding the development of robust defense and anomaly detection techniques.

Agentic red teaming is an advanced paradigm for systematically probing, stress-testing, and ultimately revealing the vulnerabilities of AI systems, particularly LLMs embedded in interactive, tool-calling, or multi-agent execution loops. Unlike classical red teaming—where adversaries target stand-alone models in static prompt-response settings—agentic red teaming interrogates deployed agentic systems as interactive state machines, treating each sequence of actions (tool use, memory manipulation, agent hand-off) as a potential attack surface. This approach recognizes the expanded and context-dependent attack surface introduced by agentic architectures, and necessitates new methodologies and observability frameworks to faithfully characterize, exploit, and defend against emergent agentic-only vulnerabilities.

1. Conceptual Foundations of Agentic Red Teaming

Agentic red teaming redefines the notion of adversarial evaluation in line with the evolution of LLMs from isolated text-generation engines to embedded controllers within complex, context-rich agentic systems. At its core, the methodology distinguishes between two tiers:

  • Model-level red teaming: The LLM is treated as a static function f:XYf : X \rightarrow Y, probed with crafted adversarial inputs xXx \in X detached from the deployment environment.
  • Agentic-level red teaming: The deployed system is viewed as an interactive state machine comprising discrete “actions” (LLM calls, tool invocations, memory operations, agent transfer), unfolding in the context of full execution traces, tool outputs, and evolving memory—an irreducible agentic trajectory inaccessible to purely model-centric attacks (Wicaksono et al., 21 Sep 2025).

This shift is motivated by empirical evidence showing a divergence in vulnerability profiles: certain exploits only trigger in agentic contexts (agentic-only vulnerabilities), while some model-only attacks are neutralized within agentic execution loops, as mediating components may intercept or attenuate adversarial prompts.

2. Methodologies and Frameworks

Agentic red teaming architectures typically integrate three elements: (i) comprehensive observability over internal agent actions, (ii) adaptive attack generation aligned with agentic context, and (iii) iterative or multi-turn attack strategies capable of exploiting temporal dependencies and stateful behaviors.

Observability Frameworks

  • AgentSeer (Wicaksono et al., 21 Sep 2025): Parses agentic execution into an ordered set of atomic actions {ai}\{a_i\} and a component set {cj}\{c_j\}, building a directed knowledge graph G=(V,E)G=(V,E) capturing dependencies (e.g., via memory channel mediation). MLFlow spans are classified into semantic action types, facilitating fine-grained visualization of both chronological action graphs and component relationships.

Agentic Attack Protocols

  • Action Injection: Red teamers inject adversarial prompts at each agentic action point aia_i, conditioning not only on current prompts but also on context CiC_i (conversation history, tool results, memory states) (Wicaksono et al., 21 Sep 2025, Syros et al., 9 Feb 2026).
  • Iterative Context-Aware Attacks: Prompt refinement progresses over multiple rounds, where each refinement po(t+1)=g(po(t),f(Cipo(t)))p_o^{(t+1)} = g(p_o^{(t)}, f(C_i \oplus p_o^{(t)})) leverages full context CiC_i, outperforming model-only iterative attacks (Wicaksono et al., 21 Sep 2025).
  • Automated Evolutionary Search: Systems like AgenticRed (Yuan et al., 20 Jan 2026) treat red-teaming system design itself as a search problem—iteratively evolving multi-agent workflows (proposer-verifier, feedback loops) via LLM-driven meta-agents.
  • Adaptive Orchestration: Frameworks such as AJAR (Dou et al., 16 Jan 2026) employ protocol-driven architecture (MCP servers, Petri runtime, Auditor Agent) to coordinate multi-turn, tool-simulating attacks with stateful backtracking.

Task Domains

Agentic red teaming now spans:

3. Distinctive Findings, Agentic-Only Vulnerability Phenomena, and Quantitative Metrics

Empirical studies demonstrate that:

  • Agentic-only vulnerabilities: Certain adversarial objectives are unexploitable at the model level but become achievable in agentic settings due to emergent logic and context (e.g., file-access attacks succeeding at tool calls but inert at core LLM) (Wicaksono et al., 21 Sep 2025, Dou et al., 16 Jan 2026).
  • Contextual vulnerability gradients: Tool-calling actions have 24% higher attack success rates than non-tool actions (average ASRtool_{tool} = 0.46 vs. ASRnontool_{non-tool} = 0.37) (Wicaksono et al., 21 Sep 2025). Agent-transfer operations can peak at 67% vulnerability.
  • Attack success rate (ASR): Varies by injection context (human, AI, or tool message), with iterative, context-aware attacks outperforming direct attacks by 4–6% (Wicaksono et al., 21 Sep 2025).
  • Attack instability: Attack effectiveness is volatile under reinjection, with a 50–80% drop when prompts are recycled, suggesting potential for context shuffling as a defense (Wicaksono et al., 21 Sep 2025).
  • Conversely, model-only vulnerabilities are not reliably reproduced in agentic loops, highlighting non-trivial dissociation between model and agentic attack surfaces.

Comprehensive tables and breakdowns of attack strategies and metrics are standard:

Injection Strategy Direct ASR Iterative ASR Δ (Iter–Direct)
Human Message 57% 61% +4%
AI Message 42% 48% +6%
Tool Message 40% 46% +6%

4. Architecture: System Components and Attack Pipelines

Advanced agentic red-teaming frameworks are modular, employing architectural strategies for adversarial exploration.

  • AJAR (Dou et al., 16 Jan 2026): Implements a cognitive Auditor Agent and MCP service servers (exposing multi-turn jailbreak algorithms, e.g., X-Teaming), all within a Petri-based agentic runtime emulating tools and environment state. Stateful backtracking and tool simulation support complex multi-turn, environment-aware exploitation pipelines.
  • AgentSeer (Wicaksono et al., 21 Sep 2025): Provides logging, action-type classification, and action graph visualization linking agentic events to component and memory dependencies, enabling precise surface identification.
  • RedTeamLLM (Challita et al., 11 May 2025): Utilizes a three-stage LLM-driven “OS” structure for penetration testing—summarization, chain-of-thought reasoning, and act—coupled to recursive planning, plan correction, and memory tracing.
  • MUZZLE (Syros et al., 9 Feb 2026): Adapts to web agents, replaying agent trajectories to identify salient injection surfaces, then automatically synthesizing and refining contextual payloads for end-to-end, cross-application prompt injection attacks.

5. Comparative Analysis and Emergent Security Gaps

Agentic red teaming not only uncovers vulnerabilities missed by classical one-shot or single-turn methodologies, but also demonstrates that integrated tool use, environmental feedback, and multi-agent interaction open qualitatively new exploit pathways:

  • Communication-based compromise: Embedding an adversarial agent ("Agent-in-the-Middle") in LLM-based multi-agent systems allows systemic manipulation undetectable by agents lacking authenticated, integrity-checked messaging (He et al., 20 Feb 2025).
  • Indirect prompt injection: Adaptive attack frameworks such as MUZZLE reveal that reply fields and user-generated content in web UIs constitute high-exploitability surfaces, enabling attacks violating confidentiality, integrity, and availability, including credential-phishing and cross-app workflow hijacks (Syros et al., 9 Feb 2026).
  • Policy circumvention: Multi-agent red-teamers employing plan extraction, deception planning, and avoidance advisory can defeat policy-adherent customer service agents at significantly higher rates than blunt jailbreak templates, even under strong policy reminders (Nakash et al., 11 Jun 2025).
  • Kaleidoscopic teaming: Iterative in-context scenario generation in synthetic multi-agent societies exposes vulnerabilities that emerge only through agent-agent interaction (deception, collusion), especially with targeted, contrastive scenario optimization (Mehrabi et al., 20 Jun 2025).

6. Implications for AI Safety, Defensive Practices, and Open Challenges

Agentic red teaming has driven several paradigm shifts and established new best practices in AI safety assessment:

  • Evaluations must be live and context-inclusive: Static, black-box prompt-response testing under-samples the true attack surface; red teaming should occur within the deployed agentic system under full environmental context (Wicaksono et al., 21 Sep 2025, Dou et al., 16 Jan 2026).
  • Tool and agent-transfer operations require explicit guardrails: Tool-calling, code execution, and inter-agent transfer are consistently higher-risk; require runtime policy checks and semantic filters (Wicaksono et al., 21 Sep 2025, Syros et al., 9 Feb 2026).
  • System-level observability is essential: Granular logging and action-graphing frameworks (e.g., AgentSeer, Petri/AJAR) are basic infrastructure for dissecting and defending complex agentic exploits (Wicaksono et al., 21 Sep 2025, Dou et al., 16 Jan 2026).
  • Multi-turn, dynamic adversarial strategies generalize best: Automated, learning-based agents leveraging evolutionary algorithms, reinforcement learning, and context-aware prompt adaptation consistently outperform manual and single-step methods, showing high transferability to proprietary models (Yuan et al., 20 Jan 2026, Xiong et al., 1 Jun 2025, Chen et al., 2 Apr 2025).
  • Defenses should exploit prompt instability and authentication: Attack prompts degrade with reinjection and transformation, suggesting utility for ephemeral prompt strategies and context rotation; inter-agent authentication and message validation are critical in multi-agent settings (He et al., 20 Feb 2025, Wicaksono et al., 21 Sep 2025).
  • Continuous, adversary-aware scenario generation is needed: Adaptive approaches such as in-context scenario optimization (kaleidoscopic teaming) or dynamic context-shuffling enable red teamers to maintain pressure as models and defenses co-evolve (Mehrabi et al., 20 Jun 2025).

Open research challenges include: framework-agnostic observability decoupled from incumbent libraries, principled co-evolution of defensive and adversarial agents, efficient multi-objective optimization over attack success rate and query cost, and formal verification of policy adherence in dialogic, multi-party agentic contexts.

7. Future Directions and Broader Impact

Agentic red teaming defines the emerging state-of-the-art in AI vulnerability discovery, with ongoing work emphasizing:

The agentic red teaming paradigm both expands attack coverage and compels a new class of formal, system-level precautions in AI deployment—solidifying its foundational role in safe, robust, and trustworthy next-generation AI systems.


Key citations:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Red Teaming.