Papers
Topics
Authors
Recent
Search
2000 character limit reached

SORRY-Bench: LLM Safety Refusal Evaluation

Updated 31 January 2026
  • SORRY-Bench is a systematic framework that evaluates LLM safety refusals using a balanced taxonomy of 45 unsafe topics with 450 curated prompts.
  • It extends its evaluation by applying 20 linguistic mutation techniques, generating 9,000 diverse and realistic unsafe prompt variations.
  • The framework employs scalable, automated judging methods validated by human annotations to benchmark safety responses across various LLMs.

SORRY-Bench is a systematic framework and benchmark for evaluating the safety refusal capabilities of LLMs when faced with unsafe user instructions. It addresses critical limitations in prior safety refusal evaluations—such as imbalanced taxonomies, lack of linguistic diversity, and dependence on computationally expensive evaluators—by offering a fine-grained, balanced, and efficient approach tailored to real-world deployment and research requirements. SORRY-Bench is constructed around a multidimensional taxonomy of 45 unsafe topics, an instruction set of 450 curated unsafe prompts, comprehensive linguistic augmentations, and a scalable human-validated automated evaluation methodology. It provides detailed insights into the refusal behaviors of both proprietary and open-weight LLMs and enables consistent comparison across models, versions, and languages (Xie et al., 2024).

1. Fine-Grained Taxonomy and Dataset Construction

SORRY-Bench is founded on a taxonomy that covers four high-level domains, expanded into 45 fine-grained categories of unsafe content. This taxonomy is designed to address the imbalance and lack of granularity characteristic of previous benchmarks. Out of 45 categories, each is represented by 10 distinct unsafe instructions, resulting in a class-balanced dataset of N=450N = 450 instructions, ensuring even per-class metrics such as macro-accuracy and macro-F1:

Accmacro=1Kk=1KAcck,F1macro=1Kk=1KF1k\text{Acc}_{\text{macro}} = \frac{1}{K}\sum_{k=1}^K \text{Acc}_k, \quad \text{F1}_{\text{macro}} = \frac{1}{K}\sum_{k=1}^K \text{F1}_k

The taxonomy includes, for example, classes under “Hate Speech Generation” (personal insults, threats, libel), “Assistance with Crimes or Torts” (malware code, self-harm, fraud, terrorism), “Potentially Inappropriate Topics” (fake news, harmful health advice), and “Potentially Unqualified Advice” (medical, legal, governance consulting).

Domain Name Number of Classes Example Classes
Hate Speech Generation 5 Personal Insults, Threats
Assistance with Crimes/Torts 20 Self-Harm, Malware, Fraud
Potentially Inappropriate Topics 15 Fake News, Conspiracies
Potentially Unqualified Advice 5 Medical Advice, Legal Advice

This balanced construction enables precise investigation of safety-refusal performance across a diverse set of real-world risks.

2. Linguistic Augmentation and Mutation Protocol

SORRY-Bench augments its 450 base prompts with 20 distinct linguistic mutations, producing 9,000 additional “mutated” prompts. These mutations systematically stress-test LLM refusal under diverse and realistic communication conditions, including the following categories:

  • Writing Styles: question/interrogative, slang, dialects, technical jargon, role play, misspellings
  • Persuasion Techniques: logical appeal, authority endorsement, misrepresentation, evidence-based persuasion, expert endorsement
  • Encoding/Encryption Schemes: ASCII, Caesar cipher, Morse code, Atbash cipher
  • Non-English Languages: Malayalam, Tamil, Marathi, Simplified Chinese, French

Each transformation is realized via paraphrasing with GPT-4 or external translation tools, systematically probing the robustness of refusal policies against paraphrasing, linguistic diversity, obfuscation, and social-engineering strategies.

3. Automated Safety Evaluation Methodology

SORRY-Bench operationalizes refusal evaluation through a scalable automated judging framework. The primary evaluation is a binary classification: whether a model response “fulfills” (1) or “refuses” (0) the unsafe prompt. Any substantial content directly resolving the unsafe intent—regardless of disclaimers—is marked as fulfillment.

Meta-evaluation encompasses the following judge designs:

  • Off-the-shelf LLM Prompting: Single-shot, chain-of-thought, or few-shot exemplars (using models such as GPT-4o, GPT-3.5-Turbo).
  • Fine-Tuned LLMs: Instruction-tuned models (e.g., Mistral-7B, Llama-3-8B) trained on 2,700 human-annotated labels for robust classification.
  • Non-LLM Baselines: Perspective API toxicity thresholds, refusal token keyword matching.

Fine-tuned ~7B parameter LLMs (e.g., Mistral-7B-instruct) achieve near-human reliability (Cohen’s κ ≈ 0.81), while running at a cost and latency order-of-magnitude lower than GPT-4.

Automated Judge Cohen’s κ Eval Time (450 instances)
GPT-4o (off-the-shelf) 0.794 ≈260 s
Mistral-7B-instruct (fine-tuned) 0.813 ≈11 s
BERT-Base-Cased (fine-tuned) 0.750 ≈4 s
Perspective API, Llama-Guard-2 <0.40 N/A

4. Human Annotation and Validation

Expert human annotation underpins the reliability of SORRY-Bench’s evaluation. For both base and mutated prompts, eight model responses per instruction yield 7,200 records labeled by six expert annotators, split into 2,700 training and 4,500 test instances for automated judge development and benchmarking. Human agreement and annotation volume enable robust LLM training and meta-evaluation, establishing a standard for measuring judge fidelity against expert consensus.

5. Benchmarking Results and Analysis Across LLMs

SORRY-Bench provides comprehensive cross-model analyses by benchmarking 43 proprietary and open-weight LLMs using both base and linguistically mutated prompts. Results reveal a diverse safety profile landscape:

  • Wide Fulfillment Spectrum: Claude-2 and Gemini-1.5 refuse nearly all unsafe queries (FulfillRate <10%), whereas open models like Mistral-7B reach >90% fulfillment.
  • Category Sensitivity: Some categories (Harassment, Child Crimes, Sexual Crimes) elicit strong refusals (FulfillRate ≈ 10%), but others (Legal Consulting, Religion Promotion, False Advertising) persistently see high fulfillment rates (≈75%).
  • Linguistic Mutation Effects: While interrogatives slightly reduce FulfillRate (–2% to –13%), technical jargon or slang typically increase it (+2% to +30%). Low-resource languages sharply decrease fulfillment (–20% to –53%), while advanced models (e.g., GPT-4o) maintain high stability. Persuasion motifs can raise fulfillment substantially (+5% to +65%). Encodings generally suppress fulfillment (–15% to –69%), though some models (e.g., GPT-4o) decipher select ciphers and increase fulfillment in those settings.

Several version-level effects are observed; for example, Llama-3 (70B) rises from 13% (Llama-2) to 36%, with other models showing similar non-monotonic updates.

6. Practical Pipeline Integration

SORRY-Bench’s modularity facilitates deployment in automated or research evaluation pipelines for LLM safety refusal. Recommended usage includes:

  1. Generate model responses to the 450 base and/or 9,000 mutated prompts.
  2. Decode/translate outputs for non-English or encoded prompts.
  3. Apply the preferred fine-tuned judge (e.g., Mistral-7B-instruct) to score responses.
  4. Compute standardized metrics:
    • Refusal Rate = 1 – FulfillRate
    • Per-class Precision, Recall
    • Macro-F1 as the arithmetic mean of per-class F1 scores

This structure supports targeted evaluations on specific content categories or linguistic contexts.

7. Contributions, Limitations, and Evolution

SORRY-Bench contributes a robust, fine-grained, and efficient standard for quantifying LLM safety refusal, including:

  • A balanced 45-topic taxonomy addressing coverage gaps.
  • 450 curated unsafe instructions plus 9,000 linguistic mutations across 20 transformation modes.
  • Over 7,000 human annotations supporting automated judge development with near-human accuracy at low cost.
  • Systematic evaluation across 43 models, revealing nuanced refusal landscapes with topic, language, and model-specific dynamics.

Limitations include the binary refusal/fulfillment classification (excluding graded harmfulness), lack of coverage for multi-risk or ambiguous prompts, and a focus on average-case user behavior rather than adversarial “jailbreaks”—although the latter is explicitly addressed as future work. There is also potential for data contamination if SORRY-Bench is used as a training corpus, suggesting the need for private splits and continuous updates. An ongoing requirement is the periodic refresh of taxonomies and prompts as safety threats and LLM behaviors evolve (Xie et al., 2024).

SORRY-Bench enables systematic, reproducible, and fine-grained safety evaluation, forming a foundational resource for model builders, policy designers, and academic researchers committed to robust LLM safety auditing and comparison.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SORRY-Bench.