ChineseSafe: A Chinese Benchmark for Evaluating Safety in LLMs
The development of LLMs has surged, driving advancements in natural language processing. However, the training of these models on vast datasets sourced from the internet surfaces the risk of generating undesirable, biased, or toxic content. This paper introduces "ChineseSafe," a benchmark designed to evaluate the capacity of LLMs to identify unsafe content within Chinese contexts, addressing a prominent gap in existing research.
Benchmark Overview
ChineseSafe stands out by aligning with regulations for Chinese internet content moderation. It encompasses 205,034 examples categorized into four primary classes with ten sub-classifications of safety issues:
- Illegal Activities: This class includes political sensitivity, pornography, and criminal conduct, aimed at distinguishing between legal and illegal content.
- Ethics and Morality: It addresses discrimination, swear words, and ethical impropriety, focusing on potential societal impacts.
- Health and Privacy: This class examines issues around physical health, mental health, and privacy leakage.
- Variant and Homophonic Words: Unique to the Chinese internet, this category deals with evasion techniques used to bypass content moderation.
Methodology
The authors compiled the dataset through various sources, including open databases and online resources, followed by rigorous data processing to ensure quality and relevance. The dataset's hierarchical structure supports nuanced safety assessments. For evaluation, the paper adopts both generation-based and perplexity-based methods. The generation approach assesses models through content prediction tasks, while perplexity serves as a measure to detect unsafe content.
Experimental Results
The paper reports extensive experiments conducted on 26 LLMs, spanning diverse parameters and organizations. The results indicate that:
- Evaluation Methods: Models perform better on safety evaluations using the generation-based method compared to the perplexity-based method.
- Model Performance: GPT-4 series and DeepSeek models generally demonstrate superior safety performance. Open-sourced models such as DeepSeek-LLM-67B-Chat achieved the highest accuracy within non-API models.
- Safety Categories: LLMs exhibit vulnerabilities across certain classes, specifically in issues related to health and privacy. There are discrepancies among models in identifying unsafe content in distinct categories.
Implications and Future Directions
ChineseSafe offers a comprehensive tool for evaluating the safety of LLMs in a Chinese context. The paper highlights the necessity for improved model alignment with regulatory standards, particularly given the legal implications of unsafe model outputs. The paper suggests that future developments should focus on enhancing model robustness across diverse content types and regulatory standards.
The implications are significant for developers, regulators, and researchers aiming to produce safer LLMs. By identifying areas where current models struggle, stakeholders can prioritize enhancements to address nuanced safety challenges in varied linguistic and cultural contexts. This benchmark thus serves as a critical contribution to ongoing efforts in promoting safer AI deployment, particularly in language-specific scenarios like those found in China.
In conclusion, ChineseSafe not only aids in understanding the current landscape of LLM hazards but also pushes the boundaries towards creating ethical and compliant AI solutions. Further work could explore cross-cultural benchmarks to support more universally applicable safety standards.