Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (2404.01833v3)

Published 2 Apr 2024 in cs.CR and cs.AI

Abstract: LLMs have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.

References (26)

Citations (46)

View on Semantic Scholar

Summary

The paper introduces a multi-turn jailbreak that uses benign, progressive dialogue to circumvent LLM safety mechanisms.
Experimental results reveal high breach effectiveness across models like ChatGPT, Gemini Pro, and LLaMA-2 70b with task-specific variations.
The study also unveils Crescendomation, an automated tool employing dual-layer validation to further exploit LLM vulnerabilities.

Introducing Crescendo: A Novel Multi-Turn Jailbreak Attack for LLMs

Crescendo: The Technique

Crescendo constitutes a sophisticated methodology designed to circumvent the safety measures implanted within LLMs. Unlike traditional jailbreak methods, Crescendo engages in a multi-turn dialogue with the model under the guise of seemingly innocuous prompts. The process commences with a general inquiry related to the jailbreak target, progressively leveraging the model's responses to steer the conversation toward generating the desired prohibited content. This multi-turn strategy benefits from both the LLM's inclination to maintain context and its focus on recent interactions, thereby enabling Crescendo to bypass the model's safety alignments discreetly and effectively.

Experimental Validation

The Crescendo technique was examined across a range of public AI systems including ChatGPT, Gemini Pro, and LLaMA-2 70b among others. To quantify its effectiveness, a diverse set of tasks, specifically contravening AI guidelines, was employed. The empirical results underscored Crescendo's potent capability to successfully breach the safety alignments across all models for nearly all evaluated tasks. Notably, intimacy-related tasks posed a considerable challenge, while misinformation-related prompts were relatively more susceptible to jailbreak attacks under Crescendo.

Automation via Crescendomation

Crescendomation emerges as an innovative tool devised to automate the multi-turn Crescendo jailbreak technique. This tool initiates with a target task and engages in an automated strategic conversation aimed at guiding the model towards fulfilling the task, thus jailbreaking the model’s restrictions. The evaluation of Crescendomation, employing a dual-layer judge mechanism complemented by manual validation and external scoring systems, indicates its effectiveness in auto-generating Crescendo jailbreaks across a variety of models with high success rates.

Theoretical and Practical Implications

Crescendo, with its benign input-based strategy, represents a significant shift from traditional jailbreak techniques that typically involve direct or adversarial inputs. This new class of jailbreak attack not only enhances our understanding of the vulnerabilities intrinsic to LLMs but also challenges the current defense mechanisms predicated on input filtering. The paper's findings suggest an urgent need for the development of more robust models capable of resisting such sophisticated jailbreak techniques, contributing to the enhancement of AI security and the ethical deployment of LLMs.

Future Perspectives

The paper opens up avenues for future research dedicated to both advancing the Crescendo technique and exploring novel defense mechanisms. The adaptive nature of Crescendo suggests that it could be refined further to increase its efficacy and stealthiness. Concurrently, the challenge it poses to existing safety mechanisms in LLMs calls for innovative approaches to model training and deployment strategies aiming at a better alignment of LLMs with ethical guidelines and societal norms.