Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models (2410.18491v1)

Published 24 Oct 2024 in cs.CL

Abstract: With the rapid development of LLMs, understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of LLMs. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark.

ChineseSafe: A Chinese Benchmark for Evaluating Safety in LLMs

The development of LLMs has surged, driving advancements in natural language processing. However, the training of these models on vast datasets sourced from the internet surfaces the risk of generating undesirable, biased, or toxic content. This paper introduces "ChineseSafe," a benchmark designed to evaluate the capacity of LLMs to identify unsafe content within Chinese contexts, addressing a prominent gap in existing research.

Benchmark Overview

ChineseSafe stands out by aligning with regulations for Chinese internet content moderation. It encompasses 205,034 examples categorized into four primary classes with ten sub-classifications of safety issues:

  1. Illegal Activities: This class includes political sensitivity, pornography, and criminal conduct, aimed at distinguishing between legal and illegal content.
  2. Ethics and Morality: It addresses discrimination, swear words, and ethical impropriety, focusing on potential societal impacts.
  3. Health and Privacy: This class examines issues around physical health, mental health, and privacy leakage.
  4. Variant and Homophonic Words: Unique to the Chinese internet, this category deals with evasion techniques used to bypass content moderation.

Methodology

The authors compiled the dataset through various sources, including open databases and online resources, followed by rigorous data processing to ensure quality and relevance. The dataset's hierarchical structure supports nuanced safety assessments. For evaluation, the paper adopts both generation-based and perplexity-based methods. The generation approach assesses models through content prediction tasks, while perplexity serves as a measure to detect unsafe content.

Experimental Results

The paper reports extensive experiments conducted on 26 LLMs, spanning diverse parameters and organizations. The results indicate that:

  • Evaluation Methods: Models perform better on safety evaluations using the generation-based method compared to the perplexity-based method.
  • Model Performance: GPT-4 series and DeepSeek models generally demonstrate superior safety performance. Open-sourced models such as DeepSeek-LLM-67B-Chat achieved the highest accuracy within non-API models.
  • Safety Categories: LLMs exhibit vulnerabilities across certain classes, specifically in issues related to health and privacy. There are discrepancies among models in identifying unsafe content in distinct categories.

Implications and Future Directions

ChineseSafe offers a comprehensive tool for evaluating the safety of LLMs in a Chinese context. The paper highlights the necessity for improved model alignment with regulatory standards, particularly given the legal implications of unsafe model outputs. The paper suggests that future developments should focus on enhancing model robustness across diverse content types and regulatory standards.

The implications are significant for developers, regulators, and researchers aiming to produce safer LLMs. By identifying areas where current models struggle, stakeholders can prioritize enhancements to address nuanced safety challenges in varied linguistic and cultural contexts. This benchmark thus serves as a critical contribution to ongoing efforts in promoting safer AI deployment, particularly in language-specific scenarios like those found in China.

In conclusion, ChineseSafe not only aids in understanding the current landscape of LLM hazards but also pushes the boundaries towards creating ethical and compliant AI solutions. Further work could explore cross-cultural benchmarks to support more universally applicable safety standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
  4. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  5. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  6. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  7. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  8. Multilingual jailbreak challenges in large language models. In The Twelfth International Conference on Learning Representations, 2023.
  9. Attacks, defenses and evaluations for llm conversation safety: A survey. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747, 2024.
  10. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020.
  11. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
  12. Chbench: A chinese dataset for evaluating health in large language models. arXiv preprint arXiv:2409.15766, 2024.
  13. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  15. The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004.
  16. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
  17. Meta. Llama3. https://llama.meta.com/llama3, 2024.
  18. Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, 2020.
  19. OpenAI. Gpt-4 technical report, 2023.
  20. Bbq: A hand-built bias benchmark for question answering. In 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, pages 2086–2105. Association for Computational Linguistics (ACL), 2022.
  21. Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models. arXiv preprint arXiv:2307.08487, 2023.
  22. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
  23. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5377–5400, 2024.
  24. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  25. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023.
  26. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  27. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  29. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023.
  30. A chinese dataset for evaluating the safeguards in large language models. arXiv preprint arXiv:2402.12193, 2024.
  31. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  32. Efficient guided generation for llms. arXiv preprint arXiv:2307.09702, 2023.
  33. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  34. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  35. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  36. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  37. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970, 2022.
  38. Opt: Open pre-trained transformer language models, 2022.
  39. Chisafetybench: A chinese hierarchical safety benchmark for large language models. arXiv preprint arXiv:2406.10311, 2024.
  40. Safetybench: Evaluating the safety of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537–15553, 2024.
  41. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Hengxiang Zhang (4 papers)
  2. Hongfu Gao (3 papers)
  3. Qiang Hu (149 papers)
  4. Guanhua Chen (71 papers)
  5. Lili Yang (39 papers)
  6. Bingyi Jing (15 papers)
  7. Hongxin Wei (45 papers)
  8. Bing Wang (246 papers)
  9. Haifeng Bai (1 paper)
  10. Lei Yang (372 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com