AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models (2401.09002v5)
Abstract: Ensuring the security of LLMs against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.
- Alex Albert. 2023. Jailbreak chat. https://www.jailbreakchat.com/.
- Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99:101805.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Facebook. 2022. Meta. introducing llama: A foundational, 65-billion-parameter large language model. https://ai.facebook.com/blog/largelanguage-model-llama-meta-ai.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
- Multi-step jailbreaking privacy attacks on chatgpt.
- Guiding large language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
- Jailbreaking chatgpt via prompt engineering: An empirical study.
- OpenAI. 2023. Gpt-4. https://openai.com/research/gpt-4.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
- Language models are unsupervised multitask learners.
- Reddit contributors. Chatgptjailbreak subreddit. https://www.reddit.com/r/ChatGPTJailbreak/.
- Smoothllm: Defending large language models against jailbreaking attacks.
- Rowena Rodrigues. 2020. Legal and human rights issues of ai: Gaps, challenges and vulnerabilities. Journal of Responsible Technology, 4:100005.
- Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model. arXiv preprint arXiv:2310.04445.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
- Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.
- Self-guard: Empower the llm to safeguard itself.
- A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Mingyu Jin (38 papers)
- Zihao Zhou (32 papers)
- Chong Zhang (137 papers)
- Yongfeng Zhang (163 papers)
- Dong Shu (16 papers)
- Liangyao Li (1 paper)