AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (2310.04451v2)

Published 3 Oct 2023 in cs.CL and cs.AI

Abstract: The aligned LLMs are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.

PDF HTML Abstract

AutoDAN: Advancing Stealthy Jailbreak Attacks on Aligned LLMs

The paper presents a robust exploration on the susceptibility of aligned LLMs to jailbreak attacks, specifically focusing on generating stealthy prompts using an innovative approach, AutoDAN. Aligned LLMs are designed to avoid generating harmful or ethically problematic outputs by incorporating extensive human feedback in their training processes. Nevertheless, the potential exists to circumvent these safeguards through carefully constructed prompts, known as jailbreak prompts. These prompts manipulate the model's outputs to bypass its constraints and produce unintended, potentially harmful responses.

The novelty of AutoDAN lies in its ability to automatically generate stealthy jailbreak prompts using a hierarchical genetic algorithm. Traditional methods of creating jailbreak prompts often suffer from issues related to scalability and stealthiness. Manual crafting of prompts is not scalable, and token-based algorithms frequently create prompts that lack semantic congruence, making them easier to detect through basic defenses like perplexity checks.

Core Contributions

Hierarchical Genetic Algorithm: AutoDAN employs a hierarchical genetic algorithm specially designed for dealing with structured discrete data like text prompts. This contrasts with previous methods by focusing on semantic meaningfulness and leveraging the hierarchical nature of language in both sentence and word choices.
Population Initialization and Mutation: The method begins by diversifying the baseline DAN prompts using LLMs, ensuring semantically meaningful variations. This serves as the initial population for the genetic algorithm, introducing necessary diversity without diverging significantly from effective base prompts.
Genetic Operations: AutoDAN incorporates advanced genetic operations, including multi-point crossovers and momentum-based scoring functions, which aid both exploration of new prompt variations and maintenance of semantic integrity.
Evaluation Against Defenses: A significant strength of AutoDAN is its ability to bypass perplexity-based defenses effectively, maintaining a stealthy profile by producing prompts indistinguishable from benign text. This is crucial in establishing a robust attack methodology that can persist against basic defensive strategies.
Transferability and Universality: The paper provides compelling evidence of AutoDAN's ability to transfer across different LLMs, including proprietary models like OpenAI’s GPT-3.5, and demonstrates cross-sample generalization by showing strong applicability to differing malicious queries.

Results and Implications

The AutoDAN framework was evaluated using the AdvBench Harmful Behaviors dataset across several LLMs, yielding a marked improvement in attack success rates over existing methods like GCG. Notably, AutoDAN maintained high effectiveness and stealthiness as reflected in lowered perplexity scores while avoiding detection by keyword-based defenses.

Practical Implications

The strong transferability of AutoDAN's prompts suggests broader vulnerabilities inherent to current LLM architectures. As the research delineates, semantic-based jailbreak prompts represent a cross-model threat vector that could complicate defense mechanisms reliant solely on output evaluation or token-based anomaly detection.

Theoretical Implications

From a theoretical perspective, this paper emphasizes the necessity of reconsidering current model alignment strategies. The capacity for semantic understanding, intrinsic to LLMs, if not adequately bounded, may lead to unanticipated exploit paths like those identified by AutoDAN. Building more robust models might require not only enhancing existing alignment objectives but also integrating newer strategies that encompass a holistic semantic understanding.

Future Directions

The AutoDAN approach invites further exploration into optimization algorithms tailored for textual structures, offering a potential avenue for both attack and defense strategies. Future work might investigate the development of real-time defensive mechanisms that can dynamically adapt to semantic-level adversarial attacks, ensuring more resilient output filtering in LLMs.

In conclusion, AutoDAN signifies a significant methodological advancement in the paper of adversarial attacks on LLMs, revealing tangible pathways to enhance both our understanding and fortification of model alignment processes. While the paper brings to light vulnerabilities that must be addressed, it simultaneously enhances our capability to craft more secure AI systems, paving the way for a safer interaction within AI-driven environments.