Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (2310.04451v2)

Published 3 Oct 2023 in cs.CL and cs.AI
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Abstract: The aligned LLMs are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.

AutoDAN: Advancing Stealthy Jailbreak Attacks on Aligned LLMs

The paper presents a robust exploration on the susceptibility of aligned LLMs to jailbreak attacks, specifically focusing on generating stealthy prompts using an innovative approach, AutoDAN. Aligned LLMs are designed to avoid generating harmful or ethically problematic outputs by incorporating extensive human feedback in their training processes. Nevertheless, the potential exists to circumvent these safeguards through carefully constructed prompts, known as jailbreak prompts. These prompts manipulate the model's outputs to bypass its constraints and produce unintended, potentially harmful responses.

The novelty of AutoDAN lies in its ability to automatically generate stealthy jailbreak prompts using a hierarchical genetic algorithm. Traditional methods of creating jailbreak prompts often suffer from issues related to scalability and stealthiness. Manual crafting of prompts is not scalable, and token-based algorithms frequently create prompts that lack semantic congruence, making them easier to detect through basic defenses like perplexity checks.

Core Contributions

  1. Hierarchical Genetic Algorithm: AutoDAN employs a hierarchical genetic algorithm specially designed for dealing with structured discrete data like text prompts. This contrasts with previous methods by focusing on semantic meaningfulness and leveraging the hierarchical nature of language in both sentence and word choices.
  2. Population Initialization and Mutation: The method begins by diversifying the baseline DAN prompts using LLMs, ensuring semantically meaningful variations. This serves as the initial population for the genetic algorithm, introducing necessary diversity without diverging significantly from effective base prompts.
  3. Genetic Operations: AutoDAN incorporates advanced genetic operations, including multi-point crossovers and momentum-based scoring functions, which aid both exploration of new prompt variations and maintenance of semantic integrity.
  4. Evaluation Against Defenses: A significant strength of AutoDAN is its ability to bypass perplexity-based defenses effectively, maintaining a stealthy profile by producing prompts indistinguishable from benign text. This is crucial in establishing a robust attack methodology that can persist against basic defensive strategies.
  5. Transferability and Universality: The paper provides compelling evidence of AutoDAN's ability to transfer across different LLMs, including proprietary models like OpenAI’s GPT-3.5, and demonstrates cross-sample generalization by showing strong applicability to differing malicious queries.

Results and Implications

The AutoDAN framework was evaluated using the AdvBench Harmful Behaviors dataset across several LLMs, yielding a marked improvement in attack success rates over existing methods like GCG. Notably, AutoDAN maintained high effectiveness and stealthiness as reflected in lowered perplexity scores while avoiding detection by keyword-based defenses.

Practical Implications

The strong transferability of AutoDAN's prompts suggests broader vulnerabilities inherent to current LLM architectures. As the research delineates, semantic-based jailbreak prompts represent a cross-model threat vector that could complicate defense mechanisms reliant solely on output evaluation or token-based anomaly detection.

Theoretical Implications

From a theoretical perspective, this paper emphasizes the necessity of reconsidering current model alignment strategies. The capacity for semantic understanding, intrinsic to LLMs, if not adequately bounded, may lead to unanticipated exploit paths like those identified by AutoDAN. Building more robust models might require not only enhancing existing alignment objectives but also integrating newer strategies that encompass a holistic semantic understanding.

Future Directions

The AutoDAN approach invites further exploration into optimization algorithms tailored for textual structures, offering a potential avenue for both attack and defense strategies. Future work might investigate the development of real-time defensive mechanisms that can dynamically adapt to semantic-level adversarial attacks, ensuring more resilient output filtering in LLMs.

In conclusion, AutoDAN signifies a significant methodological advancement in the paper of adversarial attacks on LLMs, revealing tangible pathways to enhance both our understanding and fortification of model alignment processes. While the paper brings to light vulnerabilities that must be addressed, it simultaneously enhances our capability to craft more secure AI systems, paving the way for a safer interaction within AI-driven environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Alex Albert. https://www.jailbreakchat.com/, 2023. Accessed: 2023-09-28.
  2. Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063, 2019.
  3. Promptsource: An integrated development environment and repository for natural language prompts, 2022.
  4. Matt Burgess. The hacking of chatgpt is just getting started. Wired, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. Jon Christian. Amazing “jailbreak” bypasses chatgpt’s ethics safeguards. Futurism, February, 4:2023, 2023.
  7. Jailbreaker: Automated jailbreak across multiple large language model chatbots, 2023.
  8. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  9. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In International Conference on Machine Learning, pp.  5988–6008. PMLR, 2022.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  11. Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246, 2023.
  12. Alex Havrilla. https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise, 2023. Accessed: 2023-09-28.
  13. Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
  14. Baseline defenses for adversarial attacks against aligned language models, 2023.
  15. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
  16. Open sesame! universal black box jailbreaking of large language models, 2023.
  17. Multi-step jailbreaking privacy attacks on chatgpt, 2023.
  18. Jailbreaking chatgpt via prompt engineering: An empirical study, 2023.
  19. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022.
  20. Reevaluating adversarial examples in natural language. arXiv preprint arXiv:2004.14174, 2020.
  21. AJ ONeal. https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516, 2023. Accessed: 2023-09-28.
  22. OpenAI. Snapshot of gpt-3.5-turbo from march 1st 2023. https://openai.com/blog/chatgpt, 2023a. Accessed: 2023-08-30.
  23. OpenAI. Gpt-4 technical report, 2023b.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
  26. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  27. Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4), 2023.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  29. walkerspider. https://old.reddit.com/r/ChatGPT/comments/zlcyr9/dan_is_my_new_friend/, 2022. Accessed: 2023-09-28.
  30. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a.
  31. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022b.
  32. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  33. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  34. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  35. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pp.  12–2, 2023.
  36. Universal and transferable adversarial attacks on aligned language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiaogeng Liu (19 papers)
  2. Nan Xu (83 papers)
  3. Muhao Chen (159 papers)
  4. Chaowei Xiao (110 papers)
Citations (176)