AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Abstract: The aligned LLMs are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
- Alex Albert. https://www.jailbreakchat.com/, 2023. Accessed: 2023-09-28.
- Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063, 2019.
- Promptsource: An integrated development environment and repository for natural language prompts, 2022.
- Matt Burgess. The hacking of chatgpt is just getting started. Wired, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Jon Christian. Amazing “jailbreak” bypasses chatgpt’s ethics safeguards. Futurism, February, 4:2023, 2023.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots, 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In International Conference on Machine Learning, pp. 5988–6008. PMLR, 2022.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246, 2023.
- Alex Havrilla. https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise, 2023. Accessed: 2023-09-28.
- Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
- Baseline defenses for adversarial attacks against aligned language models, 2023.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
- Open sesame! universal black box jailbreaking of large language models, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt, 2023.
- Jailbreaking chatgpt via prompt engineering: An empirical study, 2023.
- Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022.
- Reevaluating adversarial examples in natural language. arXiv preprint arXiv:2004.14174, 2020.
- AJ ONeal. https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516, 2023. Accessed: 2023-09-28.
- OpenAI. Snapshot of gpt-3.5-turbo from march 1st 2023. https://openai.com/blog/chatgpt, 2023a. Accessed: 2023-08-30.
- OpenAI. Gpt-4 technical report, 2023b.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4), 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- walkerspider. https://old.reddit.com/r/ChatGPT/comments/zlcyr9/dan_is_my_new_friend/, 2022. Accessed: 2023-09-28.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022b.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pp. 12–2, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.