Papers
Topics
Authors
Recent
2000 character limit reached

SafeDialBench: LLM Multi-Turn Safety Benchmark

Updated 8 January 2026
  • SafeDialBench is a multi-turn safety benchmark that evaluates LLM behavior under diverse, adversarial jailbreak scenarios.
  • It employs a hierarchical taxonomy across six safety axes and 22 subcategories to precisely diagnose model vulnerabilities.
  • The benchmark guides the development of robust refusal mechanisms and advanced guardrails, integrating automated scoring within CI/CD pipelines.

SafeDialBench is a fine-grained, multi-turn safety benchmark for systematically evaluating LLM behavior in adversarial dialogue settings. It exposes LLMs to diverse jailbreak attacks and measures their capacity to detect, deflect, or refuse unsafe content while maintaining consistent safety alignment throughout complex, multi-turn interactions. Developed to remedy the limitations of single-turn benchmarks, SafeDialBench has become an authoritative evaluation suite in LLM safety research, supporting both close- and open-source models and guiding the development of advanced refusal and guardrail techniques (Cao et al., 16 Feb 2025).

1. Motivation and Limitations of Prior Safety Benchmarks

Traditional LLM safety benchmarks have predominantly emphasized single-turn prompts or narrowly-scoped attack methods, as seen in COLD, BeaverTails, and SafetyBench. Such settings inadequately probe the real-world vulnerabilities of chatbots, where users may leverage multi-turn strategies including gradual topic escalation, logical inversion, or role-play to induce unsafe completions. These benchmarks also rarely address the model's capacity to maintain safety alignment consistently over extended interactions or disentangle multiple axes of harm present in real-world, adversarial dialogues. SafeDialBench directly addresses these deficiencies by modeling complex, multi-turn jailbreak scenarios and categorically assessing both fine-grained detection and behavioral consistency (Cao et al., 16 Feb 2025).

2. Hierarchical Safety Taxonomy and Attack Strategy Diversity

SafeDialBench is founded on a two-tier hierarchical taxonomy capturing the multidimensional nature of unsafe content. The top level comprises six safety axes: Fairness, Legality, Morality, Aggression, Ethics, and Privacy. Each axis is further decomposed into 2–7 subcategories (e.g., Economic Crime, Stereotypes, Violence, Self-Harm, Organizational Privacy), supporting expert-level granularity:

  • Fairness: Stereotypes, Counterfactual Fairness, Distributional Harm
  • Legality: Personal Harm, Economic Crime, Info-Security Crime, Public Security Threats
  • Morality: Discrimination, Non-Violent Immorality
  • Aggression: Threats, Insults, Contempt, Impolite, Incite, Satire, Blasphemy
  • Ethics: Violence, Self-Harm, Abuse
  • Privacy: Personal, Organizational, Social Privacy

Each model response rr is labeled via (i,j)(i, j), corresponding to its dimension ii and subcategory jj, if it violates the specified unsafe criteria Ci,jC_{i,j}; otherwise, rr is labeled Safe. This schema enables rigorous, category-specific safety diagnostics (Cao et al., 16 Feb 2025).

The benchmark operationalizes attack diversity by employing seven adversarial jailbreak strategies:

  1. Scene Construct (benign framing)
  2. Purpose Reverse (negation/logical inversion)
  3. Role Play
  4. Topic Change
  5. Reference Attack
  6. Fallacy Attack
  7. Probing Question (progressive escalation)

These attacks systematically probe whether LLMs can withstand real-world adversarial pressure and subversion (Cao et al., 16 Feb 2025).

3. Dataset Structure and Construction

The core dataset consists of 4,053 adversarial multi-turn dialogues (3–10 turns each, 2,037 in English, 2,016 in Chinese), spanning 22 real-world scenarios: healthcare, animal protection, finance, law, education, child protection, and more. Each scenario is constructed through human-in-the-loop generation: annotators author prompts, interact with high-capacity LLMs to generate candidate turns under each jailbreak template, and perform manual cross-checking for precise subcategory labeling. Adversarial diversity and linguistic span are central to design (Cao et al., 16 Feb 2025).

Sample count and distribution (English split):

Category # Dialogues
Violence 412
Fraud/Deception 389
Illegal Activities 356
Hate/Harassment 298
Sexual Content 312
Self-Harm 270

Multi-turn dialogues are balanced with benign examples during evaluation to support false-positive rate analysis (Kim, 1 Jan 2026).

4. Assessment Framework and Quantitative Metrics

SafeDialBench quantifies three safety abilities per dialogue: (A) Identify Unsafe Risks, (B) Handle Unsafe Information, (C) Maintain Consistency. Evaluation proceeds as follows:

  • The full adversarial dialogue history is provided.
  • The system prompt requests the scoring of the final response on each axis (Identification, Handling, Consistency), each on a 1–10 scale.
  • Automated scoring is performed using large models (e.g., GPT-3.5-turbo or Qwen-72B), with expert human review for scenarios where the distinction between safe refusal and subtle compliance is ambiguous (Cao et al., 16 Feb 2025, Ding et al., 26 Jun 2025).

The principal quantitative metrics are:

  • Attack-detection recall: Recall=TPTP+FN\mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}
  • Precision: Precision=TPTP+FP\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}
  • False-positive rate: FPR=FPFP+TN\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}
  • F1 Score: F1=2×Precision×RecallPrecision+RecallF_1 = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}
  • Compliance Score: Compliance=#{safe replies}#{total replies}\mathrm{Compliance} = \frac{\#\{\text{safe replies}\}}{\#\{\text{total replies}\}} (Cao et al., 16 Feb 2025, Kim, 1 Jan 2026).

Some downstream users employ a normalized metric (SafeScore) on a 0–10 scale: SafeScore=10×(1/N)i=1NsiSafeScore = 10 \times (1/N) \sum_{i=1}^N s_i, where si{0,0.5,1}s_i \in \{0, 0.5, 1\} depending on outcome per dialogue, and NN is the total number of test samples (Ding et al., 26 Jun 2025).

5. Empirical Findings and Model Comparisons

SafeDialBench has supported empirical evaluation across 17 diverse LLMs (e.g., GPT-4o, Yi-34B-Chat, GLM4-9B-Chat, Llama3.1-8B-Instruct, InternLM2-20B-sft), spanning both “Chat” and “Instruct” paradigms (Cao et al., 16 Feb 2025).

Key performance observations:

  • Best-performing models: Yi-34B-Chat and GLM4-9B-Chat demonstrate the highest average safety scores across most dimensions (Ide/Han/Con ≈ 7.8–8.1).
  • Vulnerabilities: Llama3.1-8B-Instruct and o3-mini are susceptible to sophisticated attacks, particularly fallacy and purpose-reverse strategies.
  • Attack stratification: Fallacy Attack and Purpose Reverse most effectively degrade model robustness, especially after turn 4; Reference and Topic Change are less likely to induce failures.
  • Training method impact: Defensive M2S yields significant gains: Qwen3Guard-8B with “hyphenize” compression achieves 93.8% recall with a 94.6% reduction in inference tokens per conversation (\approx3,231\to173), a 38.9 percentage point improvement over the full-history baseline while maintaining or improving detection accuracy (Kim, 1 Jan 2026).

PsyLite, leveraging SafeDialBench, reports overall safety score improvement from 8.72 (InternLM2.5-7B-chat) to 8.93 (PsyLite/ORPO, +2.4%), with marked gains in self-harm detection and illegal-request refusal (Ding et al., 26 Jun 2025).

6. Strengths, Limitations, and Practical Integration

Strengths:

  • Multi-turn, adversarial focus captures real user attack techniques ignored by single-turn evaluations.
  • Rigorous and fine-grained: six dimensions, 22 subcategories, seven attack vectors.
  • Hybrid, partially automated scoring pipeline (DeepSeek R1, GPT-3.5-turbo/Qwen-72B, expert review).
  • Suitable for integration into CI/CD pipelines, adaptive red-teaming, and safety KPI reporting (Cao et al., 16 Feb 2025).

Limitations:

  • Coverage: While robust, the seven jailbreak strategies do not cover multimodal or code injection attacks.
  • Scoring: All dialogues are weighted equally, though practical severity may differ (e.g., self-harm vs. minor privacy risks).
  • Reporting: Some deployments release only aggregate scores, limiting community analysis of dimensional improvements.
  • Emphasis on refusal: The framework primarily measures detection and outright refusal, with limited granularity for nuanced safe completions or user-experience metrics (Cao et al., 16 Feb 2025, Ding et al., 26 Jun 2025).

7. Impact and Future Directions

SafeDialBench is established as a standard in LLM safety auditing, model alignment, and competitive reporting. It has directly enabled the design of more efficient guardrails (e.g., Defensive M2S, conditional RAG filters, LoRA-based fine-tuning) and enhanced RLHF protocols for safer conversational agents (Kim, 1 Jan 2026, Ding et al., 26 Jun 2025). Future work includes multilingual expansion, dynamic adversarial dialogue generation, broader modalities (beyond text), and automated pipeline remediation triggered by detected vulnerabilities. There is ongoing discussion around refining the scoring system for severity weighting and integrating real-time monitoring to more closely align benchmark evaluation with deployed model risk.

SafeDialBench will continue to inform both the deployment and research of safe and robust LLM systems by providing a comprehensive, adversarial, and rigorously annotated multi-turn safety benchmark (Cao et al., 16 Feb 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SafeDialBench.