Papers
Topics
Authors
Recent
2000 character limit reached

DREAM: Dynamic Red-Teaming Across Environments

Updated 27 December 2025
  • DREAM is a dynamic framework that evaluates AI vulnerabilities by orchestrating adaptive, multi-stage adversarial interactions across digital and code domains.
  • It leverages methodologies like the CE-AKG, C-GPS, and adaptive RL to uncover flaws that static red-teaming methods often miss.
  • Empirical results reveal high attack success rates and contextual fragility, demonstrating the framework’s effectiveness in multi-environment safety assessments.

Dynamic Red-Teaming across Environments (DREAM) refers to a class of methodologies and frameworks that systematically evaluate, expose, and understand vulnerabilities of AI agents—particularly LLMs and code agents—across multiple operational contexts. Unlike conventional static, single-turn red-teaming, DREAM enables persistent, adaptive adversaries to construct multi-step, cross-environment attacks, leveraging stateful reasoning, dynamic adaptation, and empirical grounding in diverse domains. This approach aims to surface failure modes that arise only under complex, multi-stage interactions, providing a stringent foundation for robust AI safety assessments (Lu et al., 22 Dec 2025, Yun et al., 26 Sep 2025, Cuevas et al., 23 Sep 2025, Guo et al., 2 Oct 2025).

1. Motivation and Scope

Traditional red-teaming benchmarks for AI models evaluate static responses to individual, often templated, malicious queries. However, AI agents today operate in open-ended, multi-environment settings—making API calls, manipulating code, or responding in multiple languages. Static benchmarks fail to capture vulnerabilities that depend on prior context, span multiple environments, or result from a series of benign-looking steps (the “domino effect”). DREAM addresses these shortcomings by formalizing adversarial interactions as dynamic, multi-stage processes, systematically exploring how vulnerabilities manifest and persist across environment boundaries (Lu et al., 22 Dec 2025).

2. Core Methodological Frameworks

2.1. Cross-Environment Adversarial Knowledge Graph (CE-AKG)

The CE-AKG is a dynamic data structure maintaining the adversary’s evolving world model as a graph, with nodes representing entities (e.g., files, tokens, users) and edges encoding relations (e.g., “has_permission,” “vulnerable_to”). Formally, the system is modeled as a partially observable Markov decision process (PO-MDP) with belief state btb_t implemented as Gt=(Vt,Et)G_t = (V_t, E_t). At each attack step, atomic actions yield new observations, which are parsed and fused into the graph, permitting the adversary to track vulnerabilities, prerequisites, and cross-domain pivots efficiently (Lu et al., 22 Dec 2025).

2.2. Contextualized Guided Policy Search (C-GPS)

C-GPS governs the construction of multi-stage attack chains, using CE-AKG as state input. It operates by generating candidate atomic actions from a large library, scoring each candidate based on intrinsic exploit potential, current entity match, and strategic advancement (e.g., environment pivots). The policy πcond(bt)=argmaxaCtV(bt,a)\pi_{\mathrm{cond}}(b_t) = \arg\max_{a \in \mathcal{C}_t} V(b_t, a) selects the next best attack action, iterating through scenarios with backtracking and discounting as needed to optimize chain-level reward (Lu et al., 22 Dec 2025).

2.3. Adaptive RL-based Red-Teaming (Active Attacks)

Active Attacks reframes red-teaming as an adaptive reinforcement learning problem: the attacker LLM generates prompts, observes the victim's responses, and receives a reward (e.g., toxicity score). Crucially, after each cycle, successful attack prompts are used to safety-fine-tune the victim LLM, reducing future reward for exploited modes and compelling the attacker to find new, previously unexplored vulnerabilities. This mechanism produces an implicit easy-to-hard curriculum and prevents mode collapse by reinitializing the attacker and its replay buffer after each round (Yun et al., 26 Sep 2025).

2.4. Knowledge-Grounded, Cross-Lingual Red-Teaming

Anecdoctoring formalizes red-teaming across (language, place) pairs by extracting fact-checked misinformation claims, clustering them into narrative structures, and encoding key entities and relations as local knowledge graphs. Attack LLMs are augmented with these KGs, enabling the generation of contextually grounded prompts that maximize attack success rates uniformly across diverse linguistic and geographic settings. Environment-specific clustering and KG construction ensure that red-teaming captures both local and global adversarial narratives (Cuevas et al., 23 Sep 2025).

2.5. Adaptive Memory and Tool Selection in Code Agents

RedCodeAgent demonstrates a dynamic memory-augmented approach for code agents. Past attack trajectories—including which tools were used, outcomes, and success signals—are stored. When a new risk scenario is encountered, the agent retrieves relevant memories, infers the most effective tool chain by statistical score–cost trade-off, and evaluates generated code in realistic sandbox environments. Memory is continually updated, and cross-environment trials are handled by spinning up distinct containers or stateful sandboxes (Guo et al., 2 Oct 2025).

3. Atomic Action Libraries, Toolboxes, and Knowledge Bases

A central feature of DREAM implementations is the maintenance of comprehensive action or tool libraries capable of spanning diverse digital and physical environments.

Framework Atomic Actions/Tools Environments/Domains Covered
DREAM (Lu et al., 22 Dec 2025) 1,986 atomic attacks 349 digital environments
RedCodeAgent Jailbreak prompts, substitution LLMs Multiple code agents and programming languages
Anecdoctoring Clusters of adversarial prompts English/Spanish/Hindi; US/India

Each atomic action in DREAM incorporates a prompt template, target environment, and explicit entity requirements. RedCodeAgent’s toolbox covers both generic jailbreaks and code-specific injection techniques, while Anecdoctoring’s clusters are mapped to knowledge graphs to facilitate adversarial prompt synthesis matched to local narratives.

4. Evaluation Protocols and Metrics

DREAM frameworks measure agent safety and vulnerability using rigorous, multi-faceted metrics:

These frameworks demonstrate that multi-stage, cross-environment attacks are both highly effective (e.g., 70%+ chain-wise success on 8/12 LLMs in DREAM) and expose classes of weaknesses—such as contextual fragility and intent tracking failures—not observable under static, single-interaction benchmarking.

5. Empirical Findings and Weaknesses Exposed

DREAM evaluations consistently reveal “contextual fragility”: agents that enforce strong safety on isolated prompts often fail when context is built up over several environments, or when malicious intent is presented in a distributed, time-lagged fashion (Lu et al., 22 Dec 2025). Furthermore, agents lack effective long-term intent tracking and generally process prompts in isolation, resulting in systemic vulnerabilities to chained or multi-domain attacks. Traditional static defenses, such as initial defense prompts or prompt shielding, are ineffective over long interaction chains (Lu et al., 22 Dec 2025, Yun et al., 26 Sep 2025).

Empirical studies highlight:

  • Attack chains synthesized using CE-AKG and C-GPS succeed in >70% of cases for most LLM agents (Lu et al., 22 Dec 2025).
  • RedCodeAgent uncovers new vulnerabilities in commercial code assistants (72.7% ASR on Cursor vs. 62.6% without dynamic tool selection) (Guo et al., 2 Oct 2025).
  • Anecdoctoring’s KG-augmented attacks yield ASRs ≈ 0.90 across languages and locales, outperforming simpler adversarial pipelines (Cuevas et al., 23 Sep 2025).
  • Active Attacks yield a 440× improvement in cross-attack success rates compared to GFlowNet-only RL, while increasing diversity and coverage with nominal computational overhead (Yun et al., 26 Sep 2025).

6. Generalization, Adaptation, and Transfer

DREAM frameworks are explicitly designed for adaptability:

  • Attack policies adapt in response to moving targets—e.g., safety-fine-tuned LLMs or agents retrained on previously successful attacks—forcing the discovery of novel failure modes (Yun et al., 26 Sep 2025).
  • Transferability is observed both in the victim models (defensive payloads generated using Active Attacks are effective on unseen, larger models) and attack pipelines (KG-based attacks transfer across language/model boundaries, though defense robustness varies with model architecture and grounding) (Cuevas et al., 23 Sep 2025, Yun et al., 26 Sep 2025).
  • Modular design, such as interchangeable toolboxes and dynamic memory architectures, supports extension to new domains (e.g., web agents, robotics, planning agents) under the same design principles (Guo et al., 2 Oct 2025).

7. Limitations and Future Directions

DREAM methodologies inherit certain bottlenecks:

  • Heavy reliance on automated classifiers (e.g., for toxicity or policy violation) may blind the attack pipeline to unmodeled or subtle failures (Yun et al., 26 Sep 2025).
  • Aggressive memory resets, as in Active Attacks, may trade off retained exploration knowledge for diversity; adaptive reset schedules or hybrid policies may improve stability.
  • Current public DREAM frameworks emphasize digital environments; the adaptation to more richly embodied, real-time or non-textual scenarios is ongoing.

Directions outlined for future work include:

References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Dynamic Red-Teaming across Environments (DREAM).