Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (2404.01833v3)
Abstract: LLMs have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.
- https://chat.openai.com/share/31708d66-c735-46e4-94fd-41f436d4d3e9.
- https://gemini.google.com/share/35f0817c3a03.
- https://www.lesswrong.com/posts/RYcoJdvmoBbi5Nax7/jailbreaking-chatgpt-on-release-day.
- https://www.jailbreakchat.com/.
- https://learn.microsoft.com/en-us/python/api/overview/azure/ai-contentsafety-readme?view=azure-python.
- https://perspectiveapi.com/.
- Constitutional ai: Harmlessness from ai feedback, 2022.
- Extracting training data from large language models, 2021.
- Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023.
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023.
- Multilingual Jailbreak Challenges in Large Language Models. CoRR abs/2310.06474, 2023.
- Improving alignment of dialogue agents via targeted human judgements, 2022.
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. CoRR abs/2310.06987, 2023.
- User inference attacks on large language models, 2023.
- Pretraining language models with human preferences, 2023.
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023.
- Analyzing leakage of personally identifiable information in language models, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Maatphor: Automated variant analysis for prompt injection attacks, 2023.
- Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825, 2023.
- Jailbroken: How Does LLM Safety Training Fail? CoRR abs/2307.02483, 2023.
- Last one standing: A comparative analysis of security and privacy of soft prompt tuning, lora, and in-context learning, 2023.
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. CoRR abs/2309.10253, 2023.
- Make them spill the beans! coercive knowledge extraction from (production) llms, 2023.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.