Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (2404.01833v3)

Published 2 Apr 2024 in cs.CR and cs.AI

Abstract: LLMs have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. https://chat.openai.com/share/31708d66-c735-46e4-94fd-41f436d4d3e9.
  2. https://gemini.google.com/share/35f0817c3a03.
  3. https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day.
  4. https://www.jailbreakchat.com/.
  5. https://learn.microsoft.com/en-us/python/api/overview/azure/ai-contentsafety-readme?view=azure-python.
  6. https://perspectiveapi.com/.
  7. Constitutional ai: Harmlessness from ai feedback, 2022.
  8. Extracting training data from large language models, 2021.
  9. Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023.
  10. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023.
  11. Multilingual Jailbreak Challenges in Large Language Models. CoRR abs/2310.06474, 2023.
  12. Improving alignment of dialogue agents via targeted human judgements, 2022.
  13. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. CoRR abs/2310.06987, 2023.
  14. User inference attacks on large language models, 2023.
  15. Pretraining language models with human preferences, 2023.
  16. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023.
  17. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023.
  18. Analyzing leakage of personally identifiable information in language models, 2023.
  19. Training language models to follow instructions with human feedback, 2022.
  20. Maatphor: Automated variant analysis for prompt injection attacks, 2023.
  21. Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825, 2023.
  22. Jailbroken: How Does LLM Safety Training Fail? CoRR abs/2307.02483, 2023.
  23. Last one standing: A comparative analysis of security and privacy of soft prompt tuning, lora, and in-context learning, 2023.
  24. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. CoRR abs/2309.10253, 2023.
  25. Make them spill the beans! coercive knowledge extraction from (production) llms, 2023.
  26. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023.
Citations (46)

Summary

  • The paper introduces a multi-turn jailbreak that uses benign, progressive dialogue to circumvent LLM safety mechanisms.
  • Experimental results reveal high breach effectiveness across models like ChatGPT, Gemini Pro, and LLaMA-2 70b with task-specific variations.
  • The study also unveils Crescendomation, an automated tool employing dual-layer validation to further exploit LLM vulnerabilities.

Introducing Crescendo: A Novel Multi-Turn Jailbreak Attack for LLMs

Crescendo: The Technique

Crescendo constitutes a sophisticated methodology designed to circumvent the safety measures implanted within LLMs. Unlike traditional jailbreak methods, Crescendo engages in a multi-turn dialogue with the model under the guise of seemingly innocuous prompts. The process commences with a general inquiry related to the jailbreak target, progressively leveraging the model's responses to steer the conversation toward generating the desired prohibited content. This multi-turn strategy benefits from both the LLM's inclination to maintain context and its focus on recent interactions, thereby enabling Crescendo to bypass the model's safety alignments discreetly and effectively.

Experimental Validation

The Crescendo technique was examined across a range of public AI systems including ChatGPT, Gemini Pro, and LLaMA-2 70b among others. To quantify its effectiveness, a diverse set of tasks, specifically contravening AI guidelines, was employed. The empirical results underscored Crescendo's potent capability to successfully breach the safety alignments across all models for nearly all evaluated tasks. Notably, intimacy-related tasks posed a considerable challenge, while misinformation-related prompts were relatively more susceptible to jailbreak attacks under Crescendo.

Automation via Crescendomation

Crescendomation emerges as an innovative tool devised to automate the multi-turn Crescendo jailbreak technique. This tool initiates with a target task and engages in an automated strategic conversation aimed at guiding the model towards fulfilling the task, thus jailbreaking the model’s restrictions. The evaluation of Crescendomation, employing a dual-layer judge mechanism complemented by manual validation and external scoring systems, indicates its effectiveness in auto-generating Crescendo jailbreaks across a variety of models with high success rates.

Theoretical and Practical Implications

Crescendo, with its benign input-based strategy, represents a significant shift from traditional jailbreak techniques that typically involve direct or adversarial inputs. This new class of jailbreak attack not only enhances our understanding of the vulnerabilities intrinsic to LLMs but also challenges the current defense mechanisms predicated on input filtering. The paper's findings suggest an urgent need for the development of more robust models capable of resisting such sophisticated jailbreak techniques, contributing to the enhancement of AI security and the ethical deployment of LLMs.

Future Perspectives

The paper opens up avenues for future research dedicated to both advancing the Crescendo technique and exploring novel defense mechanisms. The adaptive nature of Crescendo suggests that it could be refined further to increase its efficacy and stealthiness. Concurrently, the challenge it poses to existing safety mechanisms in LLMs calls for innovative approaches to model training and deployment strategies aiming at a better alignment of LLMs with ethical guidelines and societal norms.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 20 posts and received 296 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com