Multi-turn Jailbreak Attacks: Strategies & Defenses
- Multi-turn jailbreak attacks are adversarial tactics that distribute a harmful objective across multiple dialogue exchanges to evade detection by LLM safety filters.
- They leverage techniques like contextual drift, foot-in-the-door compliance, and representation masking to incrementally guide models toward forbidden outputs.
- Recent automated frameworks such as ICON and HarmNet achieve high attack success rates, exposing critical vulnerabilities and driving new defense strategies.
Multi-turn jailbreak attacks are adversarial tactics designed to defeat the safety alignment of LLMs by distributing a harmful objective across multiple conversational exchanges. Rather than issuing a single overtly malicious prompt, the attacker incrementally establishes context, semantic drift, or misdirection, progressively steering the model toward outputting policy-violating or hazardous content. This attack paradigm exploits LLMs’ dialogue coherence, context fusion, and policy generalization vulnerabilities—features that are largely resilient against single-turn attacks but remain systematically susceptible under multi-turn threat models. Over the past two years, multi-turn jailbreaks have evolved from simple manual escalations to sophisticated, automated, and learning-based frameworks, challenging both the theoretical underpinnings and practical efficacy of current LLM safety defenses.
1. Formal Definitions and Core Principles
A multi-turn jailbreak comprises a conversational sequence of user prompts, where each appears benign in isolation, yet the full dialogue context sets the stage for a final adversarial request that the safety-aligned model would reject if presented directly (Sun et al., 2024, Kumarappan et al., 24 Nov 2025, Bullwinkel et al., 29 Jun 2025). The attack succeeds if the model’s final response contains the forbidden content while evading rejection by the model’s input filter .
Multi-turn attack strategies exploit several cognitive and architectural phenomena within modern LLMs:
- Contextual drift: subtle shifts in conversation history that undermine invariant safety (Hu et al., 28 Feb 2025).
- Foot-in-the-door (FITD) compliance: gradual progression from innocuous to harmful requests, increasing model acquiescence (Kumarappan et al., 24 Nov 2025).
- Representation masking: migration of hidden-state vectors toward the benign region, fooling detectors trained on single-shot harmful examples (Bullwinkel et al., 29 Jun 2025).
- Pattern exploitation: leveraging diverse conversational patterns to find model-specific and harm-category-specific blind spots (Nihal et al., 9 Oct 2025).
Formally, the attacker aims to maximize , where is the malicious target and indicates semantic satisfaction (Sun et al., 2024).
2. Attack Methodologies and Core Frameworks
Recent literature introduces highly modular frameworks that automate and optimize multi-turn jailbreak attacks:
- ActorAttack/ActorBreaker orchestrates a chain of queries leveraging actor-network theory, dynamically linking benign-appearing actor-clues to the harmful objective. This approach outperforms all tested baselines on both success and diversity metrics (Ren et al., 2024).
- Crescendo and Echo Chamber encapsulate protocol-driven and gradual-escalation attacks respectively, focusing on stepwise context building and persuasive model elaboration, often through echoing prior outputs (Bullwinkel et al., 29 Jun 2025, Alobaid et al., 9 Jan 2026).
- NEXUS and HarmNet employ semantic network expansion (ThoughtNet), iterative simulation-based refinement, and adaptive traversal to discover high-probability attack chains (Asl et al., 3 Oct 2025, Narula et al., 21 Oct 2025). Chains are scored by cumulative harmfulness and semantic alignment, using judge models for feedback-driven optimization.
- ICON formalizes Intent-Context Coupling, establishing that each malicious intent category (e.g., hacking, fraud) has an optimal context pattern (e.g., scientific research framing) that maximally reduces the safety filter’s activation. ICON exploits this via intent-driven routing and hierarchical optimization (Lin et al., 28 Jan 2026).
- Mastermind leverages hierarchical planning and a self-improving knowledge repository, dynamically recombining attack strategies through planning-execution-reflection loops for robust and transferable jailbreaks (Li et al., 9 Jan 2026).
- Siren demonstrates a pipeline combining supervised fine-tuning, direct preference optimization, and continuous self-improvement, with smaller attacker LLMs matching the performance of more powerful models at lower turn counts (Zhao et al., 24 Jan 2025).
3. Representational and Safety Analysis
Multi-turn jailbreaks manipulate the internal state of LLMs in ways that defeat defenses built for static, single-turn contexts:
- Representation Engineering: Single-turn classifiers and circuit-breaker defenses rely on fixed representations of harmfulness, but multi-turn chains (Crescendo-style) shift final hidden states into the benign region, bypassing distance or MLP probes trained solely on direct attacks (Bullwinkel et al., 29 Jun 2025). The generalization gap demands extending representational sculpting to multi-turn sequences and penalizing representation drift.
- Psychological and Narrative Exploits: The FITD principle, as operationalized in large-scale automated benchmarks, manifests as a significant +32 percentage point attack success rate increase in GPT-family models when the full conversational history is available (Kumarappan et al., 24 Nov 2025). LLMs integrate conversational context into their refusal policy, a design that can be systematically undermined by priming, narrative framing, or self-elaboration (as in Echo Chamber’s persuasion cycle) (Alobaid et al., 9 Jan 2026).
- Pattern Generalization Failure: Empirical evidence indicates that robustness to one conversation pattern does not generalize to others. Fine-tuning an LLM on information-seeking attacks in disinformation scenarios reduces attacks only within that pattern; cross-pattern robustness gaps are substantial (≥20–30 percentage points) (Nihal et al., 9 Oct 2025).
4. Empirical Results and Comparative Performance
Multi-turn jailbreaks consistently surpass both single-turn and naïve multi-round baselines under rigorous benchmarking:
| Framework | Models | ASR (Selected) | Comparative Notes |
|---|---|---|---|
| ActorAttack | GPT-4o | 79% (HarmBench) | Outperforms PAIR, Crescendo (Ren et al., 2024) |
| ICON | GPT-5.1 | 96.5% | Intent-context routing; best overall (Lin et al., 28 Jan 2026) |
| HarmNet/NEXUS | Mistral-7B | 99.4% | Modular network/semantic traversal (Asl et al., 3 Oct 2025, Narula et al., 21 Oct 2025) |
| Mastermind | GPT-5 | 60% (StrongReject) | Knowledge-driven, resilient (Li et al., 9 Jan 2026) |
| Siren | Gemini-1.5 | 90% | LLaMA-3-8B attacker on Gemini (Zhao et al., 24 Jan 2025) |
| Crescendo | LLaMA-3-8B | ~55% | Loses effectiveness as chain length increases (Bullwinkel et al., 29 Jun 2025) |
Key findings:
- ICON, NEXUS, HarmNet, and Mastermind demonstrate ASRs above 90% on state-of-the-art models, significantly surpassing single-turn or static heuristic multi-turn methods.
- Attacks such as Echo Chamber improve success rates by >50% over prior methods in complex categories (e.g., violence, hacking) (Alobaid et al., 9 Jan 2026).
- Automated multi-turn-to-single-turn (M2S) conversions (e.g., Hyphenize, Numberize, Pythonize) in (Ha et al., 6 Mar 2025) can yield even higher success rates than the original multi-turn dialogues due to “contextual blindness” in guardrails.
5. Model Failure Modes and Defense Limitations
Empirical evaluations demonstrate that multi-turn jailbreaks expose critical limitations in current safety architectures:
- Architectural Divergence: Context-dependent safety architectures (GPT-4o, GPT-5 series) are much more susceptible to FITD and narrative attacks, with ASR jumps of up to 32 percentage points in multi-turn mode. Context-agnostic architectures (Gemini 2.5 Flash) demonstrate near-immunity by stripping conversational context and evaluating the final prompt in isolation (Kumarappan et al., 24 Nov 2025).
- Guardrail Asymmetry and Asymptotics: Defending against single-turn attacks does not guarantee defense against their multi-turn equivalents—a 50% attack structure asymmetry is observed even when prompt content is held constant (e.g., Claude-3-Opus) (Gibbs et al., 2024).
- Representation Generalization Gap: Circuit-breaker style defenses cannot generalize to multi-turn chains as the hidden states for these scenarios are pushed deeper into benign regions and never intersect with the “harmful” clusters seen at training (Bullwinkel et al., 29 Jun 2025).
- Efficiency Gaps and Exploits: Automated path-planning models using swarm intelligence (e.g., Artificial Bee Colony in (Liu et al., 5 Nov 2025)) and semantic routing (ICON) dramatically reduce the average queries required per successful attack (e.g., ICON achieves 73% ASR in 5 queries; ABC achieves 98% ASR in 10.3 queries on GPT-3.5), improving adversarial efficiency over local search or random exploration baselines.
6. Defense Strategies and Emerging Mitigations
Defensive research recommends several architectural and algorithmic innovations:
- Context-Independent Filtering: Re-evaluate final user prompts independently of history, as implemented by Gemini’s “pretext stripping”; block any input that violates policy standalone (Kumarappan et al., 24 Nov 2025).
- Escalation and Anomaly Analytics: Monitor for progressive escalation in harmfulness, pattern adherence, or semantic drift; trigger stricter safety thresholds dynamically (Nihal et al., 9 Oct 2025).
- Representation-Aware Training: Expand fine-tuning objectives to cover multi-turn input distributions (e.g., multi-turn rerouting losses, dialogue-dynamics regularization) (Bullwinkel et al., 29 Jun 2025, Hu et al., 28 Feb 2025).
- Pattern-Aware and Structural Detectors: Use classifiers to identify dialogue patterns (e.g., hypothetical, technical, information-seeking) that are underrepresented in safety data and escalate scrutiny when high-risk patterns are detected (Nihal et al., 9 Oct 2025).
- Self-Reflexive Safety Engines: Algorithms such as BIID (Bidirectional Intention Inference Defense) combine forward intent inference with backward response retrospection, reducing ASR to near zero while maintaining utility (Tong et al., 25 Sep 2025).
- Graph-Structured Defenders: Multi-turn input graphs (e.g., G-Guard, an attention-aware GNN-based classifier) offer effective detection of cross-turn intent aggregation, vastly outperforming single-input rejection or static moderation layers (Huang et al., 9 Jul 2025).
- Automatic Data Curation and Adversarial Training: Incorporate multi-turn adversarial examples—particularly those generated by protocols such as ActorAttack, PE-CoA, or CFA—into the safety curriculum to better anticipate distributed threat vectors (Ren et al., 2024, Nihal et al., 9 Oct 2025, Sun et al., 2024).
7. Future Directions and Open Challenges
Multi-turn jailbreak research highlights persistent and emerging challenges for LLM safety:
- The combinatorial attack surface grows as context windows expand and models increase in capability.
- “Knowledge-driven” and “semantic-network-driven” frameworks, exemplified by Mastermind and PE-CoA, illustrate how adversaries can self-improve and generalize across model families, demanding defenses that operate at the strategy and pattern level.
- As multi-modal LLMs become standard, joint visual-textual multi-turn jailbreaks (e.g., with MLLMs in (Das et al., 8 Jan 2026)) introduce additional dimensions to adversarial space, necessitating fragment-level and ensemble-judge postprocessing.
- A fundamental implication is that robustness must be measured and optimized in both single-turn and multi-turn regimes; neglecting either gives a false sense of security (Gibbs et al., 2024, Yang et al., 11 Aug 2025).
Converging evidence demonstrates that principled, context-aware, and pattern-diverse multi-turn attacks remain dominant over naive defenses, and that robust LLM safety alignment will require advances in pattern-aware training, self-reflexive moderation, graph-based input fusion, and automated context semantics analysis across dialogue trajectories.