MTJ-Bench: Multi-Turn Jailbreak Evaluation

Updated 27 September 2025

MTJ-Bench is a standardized framework evaluating multi-turn adversarial attacks on LLMs by simulating extended dialogue scenarios.
It leverages sequential attack metrics like Attack Success Rate Gain and emphasizes both relevant and irrelevant follow-ups to capture persistent vulnerabilities.
The benchmark employs automated LLM judges and precise scoring protocols to guide improvements in safety and guardrail designs for AI systems.

A Multi-Turn Jailbreak Benchmark (MTJ-Bench) provides a standardized platform for systematically evaluating the vulnerabilities of LLMs to adversarial multi-turn attacks, in which an attacker elicits harmful or disallowed outputs over extended dialogues rather than in a single prompt. Unlike single-turn benchmarks, MTJ-Bench captures the compounding risks and disinhibition effects introduced by long-context conversations, enables assessment of persistent or context-dependent safety failures, and reflects a more realistic threat model for deployed LLM systems.

1. Motivation and Definition

MTJ-Bench is motivated by evidence that advanced LLMs, while increasingly robust to single-shot adversarial queries, remain susceptible to multi-turn attacks that exploit context accumulation, sequential strategy, guardrail “decay,” and user-led dialogue evolution (Yang et al., 9 Aug 2025, Yang et al., 11 Aug 2025, Russinovich et al., 2 Apr 2024, Cao et al., 16 Feb 2025). Multi-turn jailbreaks can arise via gradual escalation (e.g., foot-in-the-door, bridge prompts), response-based steering (contextual fusion, Re-Align), or iterative refinement (global path updates, active fabrication), and can trigger unwanted behavior either in intended (relevant) or unintended (irrelevant) conversational follow-ups.

A Multi-Turn Jailbreak Benchmark embodies several key design principles:

Conversation granularity: Each test combines an initial attack turn (typically producing a harmful output) with a sequence of follow-up queries, probing either the durability of the jailbreak state or its transferability to diverse topics.
Realism and diversity: Scenarios encompass both relevant (on-topic) and irrelevant (off-topic) follow-ups, multi-linguality, adversarial tactic diversity, and fine-grained safety categories (Cao et al., 16 Feb 2025).
Evaluation rigor: Judging protocols leverage automated or multi-agent LLM judges, sometimes with human-calibrated thresholds, to assess not only occurrence but degree, type, and persistence of unsafe outputs (Liu et al., 11 Oct 2024, Kim et al., 23 Aug 2025).

2. Formalization and Technical Structure

The technical architecture of MTJ-Bench defines the multi-turn jailbreak process as follows (Yang et al., 9 Aug 2025, Sun et al., 8 Aug 2024):

Let $q$ be an initial malicious query and $f$ an adversarial modification function. The output from the target LLM for the first turn is $o_1 = M(f(q))$ . For the second turn (either with a relevant or irrelevant question, $q_2$ ), the model output is:

$o_2 = M([f(q); o_1; q_2])$

where $[\ldots]$ denotes context concatenation. MTJ-Bench generalizes to $N$ turns, capturing full conversation objects $\mathcal{C} = \{q_1, o_1, q_2, o_2, \ldots, q_N, o_N\}$ .

A critical metric is the Attack Success Rate (ASR) for turn $k$ :

$ASR_k(M, J_k, O_k) = \frac{1}{S} \sum_{j=1}^{S} J_k(q_j^1, o_j^1, q_{2}^j, o_k^j)$

where $J_k$ is a judge function (often an LLM) determining the harmfulness of $o_k^j$ . MTJ-Bench further defines ASR Gain as the proportion of additional turn $k$ outputs that are harmful, but whose first-turn counterparts were not, quantifying the “free lunch” gained by adversaries post-compromise.

3. Dataset Construction and Adversarial Methodologies

MTJ-Bench extends existing single-turn datasets, such as HarmBench or JBB-Behaviors, by systematically pairing each attack with curated follow-up queries (Yang et al., 9 Aug 2025, Chao et al., 28 Mar 2024). Two primary axes of extension are:

Irrelevant follow-up (MTJ-Bench-ir): Independent, off-topic questions designed to test whether models generalize jailbroken states or respond irresponsibly outside of the original context.
Relevant follow-up (MTJ-Bench-re): On-topic continuations crafted to maximize semantic, procedural, or tactic continuity—drawing on shared archetypes like stepwise instructions, technique recaps, or scenario elaboration (Zhao et al., 24 Jan 2025, Weng et al., 27 Feb 2025).

Adversarial multi-turn strategies incorporated within MTJ-Bench and related frameworks include Crescendo (progressive escalation), Context Fusion (semantic blending), global refinement with active fabrication (dialogue trajectory and history rewriting), learning-based decompositions (Siren), and recursive or self-jailbreaking attacker methods (Russinovich et al., 2 Apr 2024, Sun et al., 8 Aug 2024, Zhao et al., 24 Jan 2025, Tang et al., 22 Jun 2025, Kritz et al., 9 Feb 2025).

4. Judging, Calibration, and Metrics

Robust evaluation in MTJ-Bench is achieved via automated LLM-based judges (e.g., GPT-4, Sonnet-3.7), rubric-aligned scoring templates (e.g., StrongREJECT, JAILJUDGE), and/or fine-grained human annotations (Liu et al., 11 Oct 2024, Kim et al., 23 Aug 2025, Cao et al., 16 Feb 2025). Judging protocols define correctness not simply as presence of harm, but as a function of:

Semantic similarity to gold-standard attack objectives or expected outputs
Granularity (exact, high, moderate, or low similarity with thresholds such as $\tau^* = 0.61$ )
Calibration (expected calibration error, Brier score, Wrong@High-Conf, risk-coverage curves)
Consistency across dialogue turns (minimum-score, or “worst turn” aggregation)

This empirical rigor addresses pitfalls identified in overconfident or miscalibrated LLM judges, which can underestimate persistent vulnerabilities in multi-turn contexts (Kim et al., 23 Aug 2025). Leading benchmarks (e.g., JAILJUDGE, SafeDialBench) further stratify safety dimensions (fairness, legality, morality, aggression, ethics, privacy) and support multi-lingual, in-the-wild, and adversarially synthesized attack scenarios.

5. Empirical Insights and Model Vulnerabilities

Results from MTJ-Bench validation across open and closed-source models expose that:

Multi-turn attacks produce higher ASRs than their single-turn analogues, with observed “attack success rate gains” up to 5–20% or higher, especially as the number of follow-ups increases (Yang et al., 9 Aug 2025). A notable finding is that even after an initial jailbreak, subsequent benign or off-topic queries can trigger further unsafe outputs, revealing persistence and “stateful” vulnerabilities.
Prompt structure asymmetry is pronounced: a large fraction of attacks succeed only in the multi-turn format (as much as 41.7% for some models), indicating that single-turn safety evaluations are incomplete (Gibbs et al., 29 Aug 2024).
Model lineage correlates with vulnerability: Models from the same provider display correlated attack success profiles; this facilitates transferability studies and rapid detection of weaknesses in newly released models (Yang et al., 11 Aug 2025, Liu et al., 21 Oct 2024).
Self-jailbreaking and cross-model transferability are practical concerns: black-box attackers can induce harmful outputs not only in unfamiliar targets, but even by recursively jailbreaking their own safety modules (Kritz et al., 9 Feb 2025).

Model	Benchmark	1st-turn ASR	2nd-turn ASR	ASR Gain
LLaMA-2-7B	MTJ-Bench	25%	35%	+10%
Claude-3	MTJ-Bench	30%	37%	+7%

(Illustrative: derived from aggregated results in (Yang et al., 9 Aug 2025), not exhaustive)

6. Benchmark Evolution, Limitations, and Future Directions

Multi-Turn Jailbreak Benchmarks are evolving along several vectors:

Automation and scalability: Evolutionary algorithm–guided template discovery (e.g., X‑Teaming Evolutionary M2S) provides automated, reproducible means to generate and evaluate diverse multi-turn-to-single-turn conversions, along with transparent, threshold-calibrated judge metrics (Kim et al., 10 Sep 2025, Ha et al., 6 Mar 2025).
Holistic and fine-grained assessment: New frameworks incorporate multi-agent judges producing explicit, explainable rationales and fine-grained risk scores, supporting instruction-tuning and continuous safety improvement (Liu et al., 11 Oct 2024, Cao et al., 16 Feb 2025).
Comprehensive scenario coverage: Modern MTJ-Bench variants emphasize multilingual, scenario-diverse, and adaptive adversary modeling to reflect real-world threat landscapes.
Operational guidance: Safety evaluators now recommend explicit supply of conversation objectives for LLM-as-a-judge, use of selective prediction/abstention, and risk-based deployment policies (Kim et al., 23 Aug 2025).

Limitations of current benchmarks include: potential judge miscalibration or overconfidence, insufficient coverage of highly compositional or multidomain attacks, and limited support for non-text modalities (in contrast, MMJ-Bench aims at multimodal jailbreaks but lacks available technical details for independent assessment).

Further research is warranted to extend MTJ-Bench to:

Systematically explore resetting of LLM safety states across dialogue turns
Evaluate robustness under fully automated, tactic-aware and cross-modal attack conversion
Investigate the interplay between reasoning capacity, conversational memory, and adversarial transferability within diverse LLM architectures

7. Significance for Model Development and AI Safety

The emergence of MTJ-Bench marks a decisive shift in LLM red-teaming, from isolated prompt engineering to persistent adversarial engagement over full conversations. By articulating rigorous metrics, formal modeling, and open, collaboratively maintained datasets, MTJ-Bench guides both researchers and practitioners toward a deep, reproducible understanding of multi-turn vulnerabilities and effective defenses.

These benchmarks supply both the research community and industry practitioners with actionable insights into guardrail design, adversarial training efficacy, transferability limitations, and real-world deployment risk. As LLMs are increasingly integrated into mission-critical and interactive applications, benchmarks such as MTJ-Bench will remain a cornerstone for measuring and improving the true robustness of modern generative AI systems.