JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models (2502.18935v1)

Published 26 Feb 2025 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at https://github.com/STAIR-BUPT/JailBench.

PDF Abstract

JailBench: A Comprehensive Chinese Security Assessment Benchmark for LLMs

The paper presents JailBench, a Chinese-specific benchmark created to evaluate the security vulnerabilities in LLMs. The researchers have identified a critical gap in current benchmarking practices, where existing safety benchmarks for LLMs, especially in understanding Chinese linguistic nuances, have fallen short. JailBench is presented as a solution, aiming to rigorously expose potential safety risks within LLMs specifically tailored to the Chinese context.

JailBench stands out for its refined, hierarchical safety taxonomy that encompasses a broad spectrum of potential threats unique to Chinese language and cultural nuances. This extensive classification system allows for a thorough evaluation process, capturing a wide array of harmful scenarios that might otherwise be overlooked. The authors employ a unique Automatic Jailbreak Prompt Engineer (AJPE) framework to construct JailBench, integrating jailbreak techniques that significantly enhance the effectiveness of the benchmarks by facilitating dataset expansion through context-learning.

A substantial aspect of JailBench’s development was its comprehensive evaluation against 13 leading LLMs, which revealed its superiority over existing benchmarks with an unprecedented attack success rate of 73.86% against ChatGPT. This not only underscores JailBench's efficacy in identifying hidden vulnerabilities in LLMs but also implies that even advanced models have significant room for improvement regarding security and trustworthiness.

The construction of JailBench leveraged several innovative components. The benchmark features 5 risk domains and 40 distinct risk types, which provide a robust structure for assessing vulnerabilities. Furthermore, JailBench demonstrates enhanced generation efficiency through its AJPE framework, which employs few-shot learning to create more targeted jailbreak prompts, boosting both the scale and effectiveness of the dataset.

Experimentally, JailBench introduces a paradigm shift in evaluating LLM security. By comparing its attack success rate across multiple popular models such as GPT-4, Vicuna, Llama, and others, the paper provides valuable insights into the relative robustness of various LLMs under targeted attack conditions. One significant finding is that JailBench's ability to scale harmful prompts using automated methods results in a dataset that challenges the safety mechanisms of LLMs effectively. Models with substantial parameter sizes within the same LLM clusters showed notable vulnerability, hinting at the nuanced trade-offs between model size and vulnerability to targeted attacks.

The theoretical implications of this paper are profound. As LLMs continue to evolve, ensconced in diverse linguistic and cultural ecosystems, benchmarks like JailBench will be pivotal for ensuring robust safety and ethical alignment. JailBench offers a promising path forward, not only by identifying current deficiencies in LLMs' safety infrastructures but also by setting a benchmark standard for the future development of AI tools that can better align with varied linguistic contexts.

Looking ahead, this work raises potential areas for further research and development. Enhancements in the AJPE framework could allow for even more granular prompt engineering methodologies, further expanding the versatility of the JailBench dataset in exposing LLMs to a broader array of attack scenarios. Moreover, the continual evolution of LLMs demands a dynamic benchmark that evolves congruently, which could lead to advancements in automated evaluation strategies and increased collaboration among AI researchers focusing on security.

In conclusion, JailBench sets a new standard for the security evaluation of LLMs in the Chinese context, illustrating both areas for improvement in existing AI models and paving the way for more secure and trustworthy AI systems. This work highlights the necessity of advanced benchmarks in the ongoing evolution of AI, particularly in complex and culturally rich linguistic contexts such as Chinese.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Shuyi Liu (6 papers)
Simiao Cui (1 paper)
Haoran Bu (2 papers)
Yuming Shang (4 papers)
Xi Zhang (302 papers)

Related Papers

Find Related Papers

GitHub

GitHub - STAIR-BUPT/JailBench: JailBench：大型语言模型越狱攻击风险评测中文数据集 [PAKDD 2025] (40 stars)