Multi-Turn Jailbreaks Are Simpler Than They Seem

Published 11 Aug 2025 in cs.LG | (2508.07646v1)

Abstract: While defenses against single-turn jailbreak attacks on LLMs have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that multi-turn jailbreaks can achieve over 70% success by reapplying single-turn strategies with retries.
The automated attack pipeline leveraging the StrongREJECT benchmark reveals that current defenses may be overestimated.
Empirical findings indicate that increased reasoning tokens correlate with higher vulnerabilities, urging systematic resampling in AI safety.

Analysis of "Multi-Turn Jailbreaks Are Simpler Than They Seem"

Introduction

The study, "Multi-Turn Jailbreaks Are Simpler Than They Seem," explores the vulnerability of LLMs to multi-turn jailbreak attacks. Despite advancements in defenses against single-turn jailbreaks, multi-turn attacks continue to achieve success rates exceeding 70% across major models like GPT-4, Claude, and Gemini. This paper leverages the StrongREJECT benchmark to empirically demonstrate how multi-turn attacks can often be simplified to repeated single-turn efforts without requiring sophisticated conversational strategizing.

Automated Attack Pipeline

The researchers devised an automated pipeline to evaluate both single-turn and multi-turn jailbreaking methodologies (Figure 1).

Figure 1: Automated attack pipeline for both multi-turn and single-turn attacks. The dashed section marks the part of the pipeline that's applicable to single-turn attacks, where $n_{\rm turns}=1$ .

This automation leverages a target model ( $\mathcal{M}_T$ ), an attacker ( $\mathcal{M}_A$ ), and an evaluator ( $\mathcal{M}_E$ ). A key innovation is allowing the attacker model to iterate on prompts following refusals, closely simulating human-led red teaming with dynamic prompt regeneration.

Experimental Setup

The experiment was designed to rigorously assess model vulnerabilities using the StrongREJECT dataset, encompassing 30 test cases spread across various harmful behavior categories. The setup allowed up to 8 interaction turns, with the opportunity for up to 10 refusals to be retried per interaction. This maximalist testing strategy aligns with real-world multi-turn interactions.

Results and Discussion

The results indicate that public benchmarks have historically overestimated model robustness due to neglecting the retry-and-resample approach in evaluating model defenses (Figure 2).

Figure 2: (Left) StrongREJECT score for single-turn (with and without retries after refusal) and multi-turn attacks across multiple LLMs, averaged over the test cases considered. (Right) Average score vs. number of turns (multi-turn) and number of attack attempts (single-turn), for claude-3.5-sonnet and gemini-2.5-flash-lite-preview-06-17. Shaded region indicates 1 standard deviation. Results show using more turns or more attack attempts to be equivalent.

Key findings include:

Probing Model Limits: The analysis confirms that allowing retries significantly boosts single-turn success, equating it to multi-turn efficacy.
Model Family Vulnerability: Correlations are strong among models from similar families (e.g., Claude, GPT), implying shared vulnerabilities.
Effect of Reasoning Tokens: Intriguingly, increased reasoning usage corresponded with higher jailbreak success rates (Figure 3), indicating a potential pitfall in how reasoning tasks impact vulnerability.
Figure 3: Score vs. reasoning token usage, for Claude 3.7 Sonnet (thinking) and average of OpenAI o1-, o3-, and o4-mini models, for single (dark tone) and multi-turn (light tone).

Implications and Future Work

This study offers pivotal insights for AI safety evaluation, advocating for defense strategies that emphasize inherent robustness rather than heuristic pattern detection. With retry mechanisms unveiling significant vulnerabilities, future defenses should consider systematic resampling to rigorously test and enhance model security.

Further research might explore varied multi-turn strategies beyond the tested "Direct Request" tactic to validate these findings across more complex scenarios. This seismic shift in defensive evaluation methodology could significantly enhance AI robustness, moving toward truly secure LLMs.

Conclusion

The paper decisively argues that while multi-turn jailbreaks present nuanced challenges, they add limited complexity beyond what systematic resampling can achieve. Realigning benchmarks to account for these dynamics will be crucial in fortifying AI systems against evolving adversarial tactics.

Markdown Report Issue