Multi-Turn Jailbreak Attacks
- Multi-turn jailbreak attacks are adversarial strategies that use sequential dialogue turns to bypass safety protocols in large language models.
- They exploit context accumulation and intent obfuscation to gradually introduce harmful requests, achieving higher success rates than single-turn methods.
- Empirical research shows that these attacks expose critical vulnerabilities in current defenses, driving the development of multi-turn–aware monitoring and mitigation techniques.
Multi-turn jailbreak attacks are a class of adversarial strategies targeting LLMs wherein an attacker leverages a sequence of conversational turns to bypass safety alignment and elicit otherwise restricted outputs. In contrast to single-turn jailbreaks, which depend on a single, often overt prompt to breach safety guardrails, multi-turn attacks unfold gradually, exploiting accumulated context, dialog coherence, and alignment loopholes to subvert moderation mechanisms. Recent research has both characterized and systematized these attack vectors, revealing critical threats to deployed LLMs and driving a new wave of defensive methods and benchmarks.
1. Formal Definitions and Structural Principles
A multi-turn jailbreak attack is defined as a sequence of prompts , such that at each dialog turn , the LLM response is conditioned on the preceding interaction history and the adversarial prompt : The attack is successful if the final response (or any intermediate ) satisfies a harmfulness predicate for target intent , where is a judge or classifier and a threshold. Unlike single-turn attacks, no in isolation must signal intent to violate safety; instead, malicious objectives are fragmented across turns and contextual cues.
Key structural features include:
- Context accumulation: Attack success derives from the model’s tendency to preserve coherence and escalate helpfulness as context grows.
- Intent obfuscation: Each individual turn appears benign, distributing adversarial semantics latent in the full conversation.
- Pattern and context exploitation: Attacks often instantiate in specific context patterns (e.g., scientific research, hypothetical scenarios) congruent with the malicious intent, thereby relaxing safety constraints (Lin et al., 28 Jan 2026).
- History-aware adversarial objectives: Many frameworks formalize the attack as a control policy over dialogue states, optimizing for harmfulness as a function of trajectory through a combinatorial prompt space (Li et al., 9 Jan 2026, Zhao et al., 24 Jan 2025).
2. Attack Methodologies and Instantiations
A range of methodologies for executing multi-turn jailbreak attacks has been established, with notable frameworks characterized by their planning, reasoning, or optimization mechanisms. Representative examples include:
- Knowledge-driven planning (Mastermind): Hierarchical decomposition of high-level attack strategies and low-level prompt execution, with dynamic closed-loop reflection and a knowledge repository of adaptable adversarial tactics (Li et al., 9 Jan 2026).
- Intent-context coupling (ICON): Prior-guided selection of semantically congruent context patterns for maximal relaxation of safety constraints, followed by hierarchical prompt optimization (local refinement and global context switching) (Lin et al., 28 Jan 2026).
- Reinforcement learning (SEMA): Open-loop policy optimization using intent-drift-aware reward functions that explicitly balance intent alignment, compliance risk, and instructional detail, supporting transferable attack strategies (Feng et al., 6 Feb 2026).
- Pattern-driven escalation (PE-CoA): Exploiting empirically validated conversational patterns (e.g., technical, hypothetical, personal narrative) that map to systematic model blind spots across harm categories (Nihal et al., 9 Oct 2025).
- Automated population search (ABC): Framing multi-turn jailbreak discovery as a path-planning problem in a dynamic weighted graph, solved via swarm-based (bee colony) metaheuristics for efficient trajectory optimization (Liu et al., 5 Nov 2025).
- Learning-based decomposition (Siren): Dataset construction using turn-level LLM feedback, followed by supervised and preference-optimized attackers that dynamically adapt over multi-turn dialogue, simulating real-world adversarial behavior (Zhao et al., 24 Jan 2025).
- Reasoning-augmented conversation: Reformulating harmful objectives as benign reasoning tasks, leveraging iterative LLM reasoning to penetrate safety alignment (Ying et al., 16 Feb 2025).
- Context fusion and keyword obfuscation (CFA): Masking malicious intent through keyword substitution and scenario-based context, integrating it in the later turns of a dialogue to evade surface-level toxicity filters (Sun et al., 2024).
- Echo Chamber Attack: Planting benign but semantically loaded fragments in early turns and repeatedly referencing ("echoing") them to incrementally amplify harmful content. This exploits completion and consistency biases, often through black-box interfaces (Alobaid et al., 9 Jan 2026).
3. Empirical Effectiveness and Model Vulnerabilities
Empirical studies demonstrate that multi-turn jailbreaks yield higher attack success rates (ASR) than single-turn methods, especially on models with state-of-the-art single-turn defenses. SEMA achieves an average ASR@1 of 80.1% across Qwen2.5-3B, Llama-3.1-8B, and GPT-4.1-mini, outperforming single-turn baselines by 33.9 percentage points (Feng et al., 6 Feb 2026). ICON reports 97.1% average ASR across eight leading LLMs, with rapid convergence using significantly fewer queries than prior iterative approaches (Lin et al., 28 Jan 2026). Mastermind attains substantial gains in both success rate (up to 67% on Claude 3.7 Sonnet, 60% on GPT-5) and harmfulness ratings over all prior art (Li et al., 9 Jan 2026).
Pattern-driven attacks (PE-CoA) reveal that success is closely tied to both harm category and conversational style, producing ASRs ranging from 36.67% (Claude-3-haiku, Information pattern) to 100% (Mistral-7B-Instruct, Personal pattern) (Nihal et al., 9 Oct 2025). Automated search (ABC) achieves >90% ASR across all tested LLMs with orders-of-magnitude reduction in red-teaming cost (Liu et al., 5 Nov 2025).
Significantly, quantitative analysis shows that current defenses, even when context-aware, often generalize poorly: block rates differ substantially between single and multi-turn forms of equivalent content (Gibbs et al., 2024). Some defensive architectures (Gemini 2.5 Flash) approach immunity (0.10% ASR in multi-turn) while others (GPT-4o Mini) exhibit 30-point increases in ASR when multi-turn context is exploited (Kumarappan et al., 24 Nov 2025).
4. Theoretical and Representation-Space Analyses
Recent research frames multi-turn jailbreaks as failures of latent representation control:
- Benign drift: Intermediate representations in multi-turn attacks remain close to the benign class distribution, evading layer-wise circuit breakers designed on single-turn harmful exemplars. For instance, using the Crescendo attack, the representation drift metric to the benign set decreases with each additional turn, and harmful-classification rates plummet from 60–80% (single-turn) to 10–20% (final multi-turn) (Bullwinkel et al., 29 Jun 2025).
- Path-planning abstraction: The entire multi-turn attack process is modeled as a search for an absorbing “harmful” state in a layered, prefix-dependent graph, enabling analysis of query complexity and global trajectory vulnerability (Liu et al., 5 Nov 2025).
- Safety certificates in control-theoretic frameworks: Multi-turn dialogues are modeled as neural dynamical systems with learned barrier functions. Theoretical results guarantee forward invariance of a safe set if the barrier condition is satisfied at each state, even under worst-case adversary query selection (Hu et al., 28 Feb 2025).
These perspectives highlight why single-turn defenses (e.g., circuit breakers, static input filters) often fail: they operate on local data points or fixed regions of representation space, whereas multi-turn attacks maintain an adversarial trajectory that remains within safe-appearing bounds until the final turn.
5. Benchmarks, Patterns, and Automated Generation
The emergence of robust multi-turn benchmarks and automated attack pattern mining has enabled systematic evaluation and discovery of vulnerabilities:
- Multi-Turn Human Jailbreak (MHJ) Dataset: Curated sequences of successful multi-turn human red-teaming conversations for benchmarking conversion and defense (Ha et al., 6 Mar 2025).
- Pattern mining (PE-CoA): Five structural attack patterns parameterized by exploitation of empathy, technical expertise, hypothetical reasoning, information seeking, and problem-solving each target distinct LLM weaknesses, showing non-uniform cross-pattern robustness (Nihal et al., 9 Oct 2025).
- Psychologically grounded datasets: Automated pipelines operationalizing Foot-in-the-Door (FITD) manipulation generate thousands of FITD-style conversations to stress-test contextual robustness. GPT-family models exhibit ASR increases up to 32 percentage points in multi-turn vs. single-turn (Kumarappan et al., 24 Nov 2025).
- Transferability and structure-blindness: Methods such as M2S (multi-turn-to-single-turn) demonstrate that transforming a multi-turn jailbreak into a structured single-turn prompt can match or outperform original attacks, contingent on exploiting “contextual blindness” in policy-check models (Ha et al., 6 Mar 2025).
Notably, attack success rates and defense efficacies display high correlation within model families (), supporting inherited structural blind spots and emphasizing the necessity of family-aware red teaming (Yang et al., 11 Aug 2025, Nihal et al., 9 Oct 2025).
Table: Selected Attack Success Rates for Multi-Turn Jailbreaks Across Models
| Framework / Model | Attack Success Rate (ASR) | Reference |
|---|---|---|
| SEMA (ensemble, AdvBench) | 80.1% (mean) | (Feng et al., 6 Feb 2026) |
| ICON (average, 8 LLMs) | 97.1% | (Lin et al., 28 Jan 2026) |
| Mastermind (GPT-5) | 60% | (Li et al., 9 Jan 2026) |
| PE-CoA (Gemini-1.5-flash) | 98.0% | (Nihal et al., 9 Oct 2025) |
| Siren (LLaMA-3-8B→Gemini) | 90% | (Zhao et al., 24 Jan 2025) |
| Echo Chamber (Gemini 2.5F) | 72.7% | (Alobaid et al., 9 Jan 2026) |
| ABC (GPT-3.5-Turbo) | 98% | (Liu et al., 5 Nov 2025) |
| CFA (GPT-4 Web API) | 90% | (Sun et al., 2024) |
6. Defensive Mechanisms and Limitations
Contemporary defenses respond through multiple mechanisms:
- Bidirectional Intention Inference (BIID): Combining forward intent prediction (user prompt) and backward retrospection (assistant output) in a plug-and-play filter, achieving ASR reductions to ≤2% while maintaining >90% utility (Tong et al., 25 Sep 2025).
- Neural barrier functions and dialogue steering: State-space controls block attack trajectories at the latent level with provable invariance guarantees, at the expense of some helpfulness degradation (Hu et al., 28 Feb 2025).
- Graph-based input filtering (G-Guard): Entity-level aggregation across turns, attention-aware augmentation retrieving matched malicious single-turn patterns, and GNN-based classification, yielding state-of-the-art detection across benchmarks (Huang et al., 9 Jul 2025).
- Dynamic context-sensitive filtering: Strategies include “pretext stripping” (evaluating final request in isolation), pattern adherence monitors, dynamic thresholding based on dialogue length or request specificity, and adversarial training on multi-turn data (Kumarappan et al., 24 Nov 2025, Nihal et al., 9 Oct 2025).
However, evidence from benchmarking and transfer studies highlights critical limitations:
- Defenses optimized for one pattern or content style (e.g., Information) often generalize poorly to others (e.g., Hypothetical, Personal), causing large gaps in practical robustness (Nihal et al., 9 Oct 2025).
- Naive context-agnostic or input-only filters—especially those relying on turn-by-turn heuristics—are systematically bypassed by structured, multi-turn attacks or converted single-turn formats exploiting “contextual blindness” (Ha et al., 6 Mar 2025).
- Family-wide correlated vulnerabilities suggest that new LLM releases should adopt cross-family and structure-aware adversarial sweeps before deployment (Yang et al., 11 Aug 2025).
7. Implications, Open Problems, and Research Directions
Extant research converges on several implications:
- Multi-turn jailbreaks present an increasingly practical and potent adversarial threat not fully addressed by single-turn or pattern-oblivious defenses.
- Structured pattern exploitation and intent-context coupling represent fundamental alignment vulnerabilities, necessitating adversarial training and real-time detection over conversation trajectories.
- Future defensive research is oriented toward multi-turn–aware monitors, structural pattern tracking, continuous latent-state oversight, and large-scale automated adversarial benchmarking.
- Open challenges include extending methods and defenses to multimodal LLMs (MLLMs), optimizing efficiency/robustness trade-offs in attack and defense, developing holistic prompt-block analysis, and closing the transfer gap across model architectures and deployment environments (Das et al., 8 Jan 2026, Lin et al., 28 Jan 2026, Ha et al., 6 Mar 2025).
Continued advance of both attack frameworks and defense strategies in the multi-turn regime will be essential for trustworthy, robust deployment of future LLMs in open-ended conversational environments.