ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models (2410.18491v2)

Published 24 Oct 2024 in cs.CL

Abstract: With the rapid development of LLMs, understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of LLMs. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark. Additionally, we release a test set comprising 200,000 examples, which is publicly accessible at https://huggingface.co/datasets/SUSTech/ChineseSafe.

References (41)

Summary

The paper presents ChineseSafe, a benchmark that systematically evaluates LLMs for unsafe Chinese content using both generation-based and perplexity-based methods.
It details a dataset of 205,034 examples organized into four safety categories, enabling nuanced analysis of legal, ethical, health, and evasion issues.
Experiments on 26 LLMs reveal performance gaps with models like GPT-4 and DeepSeek outperforming others, highlighting the need for stronger regulatory alignment.

ChineseSafe: A Chinese Benchmark for Evaluating Safety in LLMs

The development of LLMs has surged, driving advancements in natural language processing. However, the training of these models on vast datasets sourced from the internet surfaces the risk of generating undesirable, biased, or toxic content. This paper introduces "ChineseSafe," a benchmark designed to evaluate the capacity of LLMs to identify unsafe content within Chinese contexts, addressing a prominent gap in existing research.

Benchmark Overview

ChineseSafe stands out by aligning with regulations for Chinese internet content moderation. It encompasses 205,034 examples categorized into four primary classes with ten sub-classifications of safety issues:

Illegal Activities: This class includes political sensitivity, pornography, and criminal conduct, aimed at distinguishing between legal and illegal content.
Ethics and Morality: It addresses discrimination, swear words, and ethical impropriety, focusing on potential societal impacts.
Health and Privacy: This class examines issues around physical health, mental health, and privacy leakage.
Variant and Homophonic Words: Unique to the Chinese internet, this category deals with evasion techniques used to bypass content moderation.

Methodology

The authors compiled the dataset through various sources, including open databases and online resources, followed by rigorous data processing to ensure quality and relevance. The dataset's hierarchical structure supports nuanced safety assessments. For evaluation, the paper adopts both generation-based and perplexity-based methods. The generation approach assesses models through content prediction tasks, while perplexity serves as a measure to detect unsafe content.

Experimental Results

The paper reports extensive experiments conducted on 26 LLMs, spanning diverse parameters and organizations. The results indicate that:

Evaluation Methods: Models perform better on safety evaluations using the generation-based method compared to the perplexity-based method.
Model Performance: GPT-4 series and DeepSeek models generally demonstrate superior safety performance. Open-sourced models such as DeepSeek-LLM-67B-Chat achieved the highest accuracy within non-API models.
Safety Categories: LLMs exhibit vulnerabilities across certain classes, specifically in issues related to health and privacy. There are discrepancies among models in identifying unsafe content in distinct categories.

Implications and Future Directions

ChineseSafe offers a comprehensive tool for evaluating the safety of LLMs in a Chinese context. The paper highlights the necessity for improved model alignment with regulatory standards, particularly given the legal implications of unsafe model outputs. The paper suggests that future developments should focus on enhancing model robustness across diverse content types and regulatory standards.

The implications are significant for developers, regulators, and researchers aiming to produce safer LLMs. By identifying areas where current models struggle, stakeholders can prioritize enhancements to address nuanced safety challenges in varied linguistic and cultural contexts. This benchmark thus serves as a critical contribution to ongoing efforts in promoting safer AI deployment, particularly in language-specific scenarios like those found in China.

In conclusion, ChineseSafe not only aids in understanding the current landscape of LLM hazards but also pushes the boundaries towards creating ethical and compliant AI solutions. Further work could explore cross-cultural benchmarks to support more universally applicable safety standards.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/jcmittelstaedt/status/1850580223995040019