Papers
Topics
Authors
Recent
Search
2000 character limit reached

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Published 17 Jan 2024 in cs.CL | (2401.09002v6)

Abstract: Jailbreak attacks represent one of the most sophisticated threats to the security of LLMs. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the attacking prompts' effectiveness. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset is a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in prompt injection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Alex Albert. 2023. Jailbreak chat. https://www.jailbreakchat.com/.
  2. Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99:101805.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  5. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Facebook. 2022. Meta. introducing llama: A foundational, 65-billion-parameter large language model. https://ai.facebook.com/blog/largelanguage-model-llama-meta-ai.
  8. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
  9. Multi-step jailbreaking privacy attacks on chatgpt.
  10. Guiding large language models via directional stimulus prompting. arXiv preprint arXiv:2302.11520.
  11. Autodan: Generating stealthy jailbreak prompts on aligned large language models.
  12. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  13. Jailbreaking chatgpt via prompt engineering: An empirical study.
  14. OpenAI. 2023. Gpt-4. https://openai.com/research/gpt-4.
  15. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  16. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
  17. Language models are unsupervised multitask learners.
  18. Reddit contributors. Chatgptjailbreak subreddit. https://www.reddit.com/r/ChatGPTJailbreak/.
  19. Smoothllm: Defending large language models against jailbreaking attacks.
  20. Rowena Rodrigues. 2020. Legal and human rights issues of ai: Gaps, challenges and vulnerabilities. Journal of Responsible Technology, 4:100005.
  21. Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model. arXiv preprint arXiv:2310.04445.
  22. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
  23. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  24. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855.
  25. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.
  26. Self-guard: Empower the llm to safeguard itself.
  27. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003.
  28. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  29. A survey of large language models. arXiv preprint arXiv:2303.18223.
Citations (8)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 0 likes about this paper.