ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models (2410.18491v2)
Abstract: With the rapid development of LLMs, understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of LLMs. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark. Additionally, we release a test set comprising 200,000 examples, which is publicly accessible at https://huggingface.co/datasets/SUSTech/ChineseSafe.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
- DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Multilingual jailbreak challenges in large language models. In The Twelfth International Conference on Learning Representations, 2023.
- Attacks, defenses and evaluations for llm conversation safety: A survey. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747, 2024.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020.
- Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
- Chbench: A chinese dataset for evaluating health in large language models. arXiv preprint arXiv:2409.15766, 2024.
- A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
- Meta. Llama3. https://llama.meta.com/llama3, 2024.
- Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- Bbq: A hand-built bias benchmark for question answering. In 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, pages 2086–2105. Association for Computational Linguistics (ACL), 2022.
- Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models. arXiv preprint arXiv:2307.08487, 2023.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5377–5400, 2024.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023.
- A chinese dataset for evaluating the safeguards in large language models. arXiv preprint arXiv:2402.12193, 2024.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- Efficient guided generation for llms. arXiv preprint arXiv:2307.09702, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970, 2022.
- Opt: Open pre-trained transformer language models, 2022.
- Chisafetybench: A chinese hierarchical safety benchmark for large language models. arXiv preprint arXiv:2406.10311, 2024.
- Safetybench: Evaluating the safety of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537–15553, 2024.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.