Systematically Evaluating LLM Safety Refusal Behaviors with SORRY-Bench
SORRY-Bench presents a comprehensive framework for evaluating the safety refusal behaviors of LLMs. Developed to address limitations in existing methods, this benchmark prioritizes nuanced and granular analysis over a broad spectrum of potentially unsafe topics. This paper systematically constructs a more balanced and fine-grained taxonomy of harmful instructions, integrates diverse linguistic formats, and proposes efficient automated safety evaluators. The results from evaluating 40+ LLMs offer both a detailed insight into their refusal behaviors and a robust methodology for future developments in AI safety.
Fine-Grained and Diverse Evaluation Taxonomy
SORRY-Bench's taxonomy categorizes potentially unsafe instructions into 45 classes, spanning four high-level domains: Hate Speech Generation, Assistance with Crimes or Torts, Potentially Inappropriate Topics, and Potentially Unqualified Advice. This granularity addresses the coarse definitions commonly found in prior datasets, where broad categories often obfuscate specific risks. Noteworthy is the systematic approach to taxonomy development, employing a human-in-the-loop methodology to refine and ensure comprehensive coverage.
Dataset Collection and Balance
To construct a balanced dataset, the authors synthesized and expanded upon 10 prior benchmarks, creating a total of 450 class-balanced unsafe instructions. This effort mitigates the over-representation of certain categories noted in previous work, such as "Fraud" and "Sexual Explicit Content Generation," and ensures underrepresented but critical categories like "Animal-related Crimes" and "Self-Harm" are sufficiently covered.
Linguistic Mutations and Diversity
Addressing the variability in user prompts, SORRY-Bench includes 20 linguistic augmentations, expanding the dataset by 9,000 instructions. These mutations cover variations such as different languages, dialects, writing styles, and encoding strategies. By decoupling linguistic characteristics from content, the benchmark evaluates LLMs' capabilities to recognize and refuse unsafe prompts across diverse formats, ensuring robustness against sophisticated prompt engineering.
Efficient and Accurate Automated Evaluators
SORRY-Bench advances the methodology for automated safety evaluation by conducting a meta-evaluation on human annotations, collected to form a 7,200-record dataset. Various design choices for LLM-based evaluators were compared, revealing that fine-tuned smaller models (e.g., 7B parameters) can achieve accuracy comparable to larger models like GPT-4, but with significantly lower computational costs. The chosen judge, fine-tuned Mistral-7b-instruct-v0.2, for instance, strikes a balance with over 80% agreement with human evaluators and an evaluation time of approximately 10 seconds per pass.
Benchmark Results and Implications
The evaluation of 43 LLMs on SORRY-Bench reveals significant variations in refusal behaviors. Claude-2 and Gemini-1.5 models exhibit the strongest refusal behaviors, with fulfiLLMent rates under 10%. On the contrary, models like Mistral series have notably higher fulfiLLMent rates, exceeding 50%. Such discrepancies highlight the diverse safety policies and alignment goals pursued by different model creators. Analyzing these results provides critical insights into the adherence to safety standards across industry and open-source models.
The paper also underscores the dynamic nature of LLM safety, as seen in the temporal analysis of models like GPT-4 and Llama series. Changes in fulfiLLMent rates across model versions reflect the evolving strategies of model developers in response to emerging safety challenges and regulatory guidelines.
Evaluating the Impact of Linguistic Diversity
The impact of linguistic mutations showed that specific styles and formats (e.g., technical terms, persuasion techniques) significantly affect compliance rates. In contrast, encoding and encryption transformations generally decreased fulfiLLMent rates, as models often failed to comprehensively decode these requests. These findings emphasize the necessity for LLMs to robustly handle diverse and complex prompt formats to ensure comprehensive safety refusal.
Future Directions and Conclusion
SORRY-Bench provides a crucial foundation for refining LLM safety evaluations. However, the paper acknowledges areas for further research, such as evaluating multi-risk scenarios and ensuring continuous updates to encompass evolving safety standards. Future enhancements may include integrating advanced jailbreaking techniques and extending datasets to capture new emerging threats.
In conclusion, SORRY-Bench offers a rigorous, granular, and balanced approach to assessing LLM safety refusal behaviors. It serves as an invaluable tool for researchers and practitioners, enabling systematic improvements and ensuring safer, more robust AI deployments.