Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SafetyBench: Evaluating the Safety of Large Language Models (2309.07045v2)

Published 13 Sep 2023 in cs.CL

Abstract: With the rapid development of LLMs, increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at \url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at \url{https://LLMbench.ai/safety}{https://LLMbench.ai/safety}.

SafetyBench: A Comprehensive Benchmark for Evaluating the Safety of LLMs

The emergence and proliferation of LLMs such as GPT-4 and ChatGPT have brought notable advancements in natural language processing. Yet, these models have simultaneously surfaced significant concerns regarding their safety, spanning issues such as privacy breaches, toxic content, and social biases. Addressing these concerns necessitates a rigorous evaluation framework capable of systematically assessing LLMs across diverse safety dimensions. The paper "SafetyBench: Evaluating the Safety of LLMs with Multiple Choice Questions" introduces such a framework—SafetyBench—and provides an extensive overview of this benchmark’s design, utility, and implications for the development of safer LLMs.

Overview and Implementation

SafetyBench is articulated as a robust, multi-faceted benchmark designed to gauge the safety of LLMs through a series of 11,435 multiple-choice questions. These questions are categorized across seven distinct safety concerns: Offensiveness, Unfairness and Bias, Physical Health, Mental Health, Illegal Activities, Ethics and Morality, and Privacy and Property. This classification underscores a comprehensive approach to evaluating safety, where LLMs must not only avoid harmful outputs but also demonstrate awareness and understanding of nuanced ethical and safety-related contexts.

A notable feature of SafetyBench is its bilingual composition, encompassing both Chinese and English datasets. This enhances the benchmark’s applicability to a wide array of LLMs, acknowledging the linguistic diversity inherent in global AI deployments. Furthermore, SafetyBench’s reliance on multiple-choice questions facilitates automation in evaluation, thereby streamlining the process of benchmarking diverse models efficiently.

Evaluation Results and Analysis

The authors conducted extensive evaluations on 25 prominent LLMs, encompassing API-based and open-sourced models from various organizations. These evaluations were executed in both zero-shot and few-shot settings to assess models' performances without extensive task-specific training. The results unearthed a significant performance disparity, with GPT-4 demonstrably leading in terms of safety across most categories, notably excelling in areas related to Physical Health and Ethics and Morality. The findings also highlighted that many models perform below optimal safety thresholds, particularly in categories such as Unfairness and Bias, where accuracy commonly fell below 70%.

The bilingual support revealed interesting dynamics: LLMs developed by Chinese organizations generally excel in Chinese data, whereas models from Western entities such as OpenAI’s GPT series perform more uniformly across both languages. This linguistic dichotomy emphasizes the importance of culturally nuanced evaluation frameworks in assessing the comprehensive safety of LLMs.

Implications and Future Directions

By presenting SafetyBench, the authors provide the AI community with a pivotal tool for the rigorous evaluation and enhancement of LLM safety. As the paper elucidates, future improvements in LLMs demand not only algorithmic sophistication but also enhanced semantic understanding to align model outputs with human safety and ethical standards.

SafetyBench has theoretical implications for advancing the understanding of LLM safety, potentially catalyzing research into areas such as culturally adaptive models and improved safety alignment techniques. Practically, SafetyBench can foster accelerated iteration and development of LLMs that are more robust and safer for deployment within diverse societal contexts.

The authors wisely suggest that improving model safety involves comprehensive solutions beyond mere leaderboard optimization. Instead, a holistic approach that advances both the safety and functionality of LLMs, possibly through in-depth understanding and mitigation of safety concerns, is essential.

In summary, SafetyBench pioneers a pathway toward a more secure future for LLM deployment, serving as both a benchmark and a catalyst for ongoing research into the safety of these powerful linguistic tools. This paper makes a compelling case for structured evaluation as a cornerstone of safe AI development, paving the way for advancements that are critically needed in the field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhexin Zhang (26 papers)
  2. Leqi Lei (4 papers)
  3. Lindong Wu (3 papers)
  4. Rui Sun (105 papers)
  5. Yongkang Huang (6 papers)
  6. Chong Long (4 papers)
  7. Xiao Liu (402 papers)
  8. Xuanyu Lei (10 papers)
  9. Jie Tang (302 papers)
  10. Minlie Huang (225 papers)
Citations (61)