Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models (2401.09002v5)

Published 17 Jan 2024 in cs.CL

Abstract: Ensuring the security of LLMs against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Alex Albert. 2023. Jailbreak chat. https://www.jailbreakchat.com/.
  2. Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99:101805.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  5. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Facebook. 2022. Meta. introducing llama: A foundational, 65-billion-parameter large language model. https://ai.facebook.com/blog/largelanguage-model-llama-meta-ai.
  8. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
  9. Multi-step jailbreaking privacy attacks on chatgpt.
  10. Guiding large language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520.
  11. Autodan: Generating stealthy jailbreak prompts on aligned large language models.
  12. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  13. Jailbreaking chatgpt via prompt engineering: An empirical study.
  14. OpenAI. 2023. Gpt-4. https://openai.com/research/gpt-4.
  15. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  16. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
  17. Language models are unsupervised multitask learners.
  18. Reddit contributors. Chatgptjailbreak subreddit. https://www.reddit.com/r/ChatGPTJailbreak/.
  19. Smoothllm: Defending large language models against jailbreaking attacks.
  20. Rowena Rodrigues. 2020. Legal and human rights issues of ai: Gaps, challenges and vulnerabilities. Journal of Responsible Technology, 4:100005.
  21. Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model. arXiv preprint arXiv:2310.04445.
  22. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
  23. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  24. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855.
  25. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.
  26. Self-guard: Empower the llm to safeguard itself.
  27. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003.
  28. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  29. A survey of large language models. arXiv preprint arXiv:2303.18223.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Mingyu Jin (38 papers)
  2. Zihao Zhou (32 papers)
  3. Chong Zhang (137 papers)
  4. Yongfeng Zhang (163 papers)
  5. Dong Shu (16 papers)
  6. Liangyao Li (1 paper)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com