Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (2403.12171v1)

Published 18 Mar 2024 in cs.CL and cs.AI

Abstract: Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of LLMs. They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.

Exploring the Vulnerability Landscape of LLMs with EasyJailbreak

Introduction to EasyJailbreak

Recent advancements in LLMs have been phenomenal, reshaping the landscape of natural language processing. However, these strides are accompanied by growing concerns over model security, especially concerning jailbreak attacks that aim to elicit prohibited outputs by circumventing model safeguards. Here, EasyJailbreak, a unified framework designed to streamline the construction and evaluation of jailbreak attacks against LLMs, is introduced to the field. EasyJailbreak decomposes the process into four main components: Selector, Mutator, Constraint, and Evaluator, allowing for comprehensive security evaluations across diverse LLMs.

Core Features of EasyJailbreak

  • Standardized Benchmarking: With support for 12 distinct jailbreak attacks, EasyJailbreak offers a standardized platform for comparing these methods under a unified framework.
  • Flexibility and Extensibility: The modular architecture encourages reusability and minimizes development effort, making it easier for researchers to contribute novel components.
  • Model Compatibility: Ranging from open-source models to closed models like GPT-4, EasyJailbreak’s integration with HuggingFace’s transformers complements its wide model support, offering substantial versatility.

Evaluation through EasyJailbreak

A substantial validation across 10 LLMs revealed a significant breach probability of around 60% on average under various jailbreak attacks. Notably, high-profile models such as GPT-3.5-Turbo and GPT-4 demonstrated Attack Success Rates (ASR) of 57% and 33% respectively, highlighting the critical security vulnerabilities present even in state-of-the-art models.

The Framework's Components

  • Selector: Key to identifying threatening instances from a pool, optimizing mutation algorithms by choosing the most promising candidate based on a selection strategy.
  • Mutator: Vital in modifying jailbreak prompts to maximize the likelihood of bypassing safeguards, contributing significantly to the iterative refinement process of the attack.
  • Constraint: Filters out ineffective instances, ensuring a focused and viable attack execution by devising criteria to eliminate poor candidates.
  • Evaluator: Assesses the success of each jailbreak attempt, crucial for determining the effectiveness of an attack and guiding the optimization process.

Practical Implications and Theoretical Insight

The revealing statistics from EasyJailbreak's evaluations draw attention to the urgent need for enhanced security measures to protect against jailbreak attacks. The framework’s modularity and compatibility features present a significant tool for ongoing and future security assessments, offering both practical and theoretical benefits. For practical applications, EasyJailbreak simplifies the process of identifying vulnerabilities, shaping the development of more secure model architectures. Theoretically, this pioneering work ignites a new area of research focused on developing standardized benchmarks for evaluating model security, offering a structured approach to a previously scattered field.

Speculations on Future Developments

The landscape of AI is ever-evolving, with newer models and more complex architectures continually emerging. As these systems become more intricate, so do the potential security threats they face. EasyJailbreak's infrastructure provides a robust foundation for adapting to these changes, potentially guiding the development of next-generation LLMs that inherently integrate more robust security measures. Furthermore, the framework’s open architecture invites community engagement, fostering a collaborative effort towards a more secure AI future.

Final Thoughts

The introduction of EasyJailbreak marks a significant milestone in the quest to secure LLMs against jailbreak attacks. Its comprehensive approach to standardizing the evaluation of such attacks positions it as an indispensable tool in the AI security domain. Moreover, by highlighting the vulnerabilities in current LLMs, it catalyzes a shift towards the development of more secure models, ensuring their safe deployment in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Defending against alignment-breaking attacks via robustly aligned llm. ArXiv, abs/2309.14348.
  4. Jailbreaking black box large language models in twenty queries.
  5. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  6. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
  7. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.
  8. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  9. Llm self defense: By self examination, llms know they are being tricked. ArXiv, abs/2308.07308.
  10. Baseline defenses for adversarial attacks against aligned language models. ArXiv, abs/2309.00614.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825.
  12. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446.
  13. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  14. Deepinception: Hypnotize large language model to be jailbreaker.
  15. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  16. Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717.
  17. Tree of attacks: Jailbreaking black-box llms automatically.
  18. Smoothllm: Defending large language models against jailbreaking attacks. ArXiv, abs/2310.03684.
  19. Fast adversarial attacks on language models in one gpu minute.
  20. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations.
  21. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. ArXiv, abs/2308.03825.
  22. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  23. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  25. Jailbroken: How does llm safety training fail?
  26. Jailbreak and guard aligned language models with only few in-context demonstrations.
  27. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
  28. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv preprint arXiv:2309.05274.
  29. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.
  30. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
  31. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373.
  32. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  33. Universal and transferable adversarial attacks on aligned language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Weikang Zhou (10 papers)
  2. Xiao Wang (507 papers)
  3. Limao Xiong (9 papers)
  4. Han Xia (6 papers)
  5. Yingshuang Gu (2 papers)
  6. Mingxu Chai (6 papers)
  7. Fukang Zhu (8 papers)
  8. Caishuang Huang (13 papers)
  9. Shihan Dou (46 papers)
  10. Zhiheng Xi (37 papers)
  11. Rui Zheng (79 papers)
  12. Songyang Gao (28 papers)
  13. Yicheng Zou (20 papers)
  14. Hang Yan (86 papers)
  15. Yifan Le (1 paper)
  16. Ruohui Wang (6 papers)
  17. Lijun Li (30 papers)
  18. Jing Shao (109 papers)
  19. Tao Gui (127 papers)
  20. Qi Zhang (785 papers)
Citations (21)