Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (2310.15140v2)

Published 23 Oct 2023 in cs.CR, cs.AI, cs.CL, and cs.LG
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

Abstract: Safety alignment of LLMs can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these prompts, generated from scratch using gradients, are interpretable and diverse, with emerging strategies commonly seen in manual jailbreak attacks. They also generalize to unforeseen harmful behaviors and transfer to black-box LLMs better than their unreadable counterparts when using limited training data or a single proxy model. Furthermore, we show the versatility of AutoDAN by automatically leaking system prompts using a customized objective. Our work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability.

Analysis of "AutoDAN: Interpretable Gradient-Based Adversarial Attacks on LLMs"

The paper evaluates vulnerabilities associated with LLMs to both manual and automatic adversarial attacks, emphasizing how safety alignment is often inadequate. It challenges existing notions that current detection and mitigation strategies for adversarial attacks are effective by introducing a novel approach, AutoDAN. This approach blends the interpretability and syntactic sophistication of manual attacks with the automated scalability of gradient-based attacks.

Conceptual Framework and Methodology

AutoDAN—short for Automatically Do-Anything-Now—stands out as an interpretable, gradient-based adversarial attack strategy optimized for readability and efficiency in compromising LLMs. Unlike its predecessors, which produce unreadable gibberish, AutoDAN generates adversarial sequences that pass perplexity-based filters, retaining human readability and coherence. AutoDAN operates by generating token sequences iteratively: optimizing one token at a time, from left to right, while maintaining a balance between two core objectives—jailbreaking the model and ensuring the sequence remains within human-sensible syntax.

The paper describes a two-stage optimization framework: preliminary selection, which narrows down a list of candidate tokens by combining gradients of the jailbreak objective and the readability likelihood, followed by fine selection, which further refines this selection using a weighted combination of the two earlier mentioned objectives. Token selection adapts dynamically to entropy variations across tokens, modulating the weight of the jailbreak objective relative to the context's importance.

Results and Implications

Empirical results underscore the efficacy of AutoDAN in bypassing existing defenses. AutoDAN achieves high attack success rates against models like Vicuna-7B, Guanaco, and Pythia-12B, even with synthetic perplexity-based defenses in place. It ensures the generated prompts are not just successful in jailbreaking but also semantically coherent, thus avoiding detection that the current defenses would typically rely on.

The research notes emergent strategies within AutoDAN prompts such as "Shifting Domains" and "Detailizing Instructions," tactics that naturally align with human-crafted jailbreak strategies. This reflects an understanding of how LLMs interpret context and emphasizes the need for more robust defenses that consider the nuanced nature of adversarial attacks beyond mere gibberish detection.

Broader Impact and Future Directions

AutoDAN illustrates potential weaknesses in current LLM protection mechanisms and suggests that model creators should explore defense strategies beyond simple filtering and blacklisting of known attack vectors. The paper also expands on the utility of AutoDAN in tasks like prompt leaking, thus arguing for its versatility in evaluating other vulnerable points within LLM deployments.

Future Trajectories: Developing more sophisticated defenses, such as embedding self-awareness and contextual understanding within LLMs, can be a critical enhancement. Moreover, given AutoDAN’s adaptability and effectiveness, enhancing LLMs’ understanding of complex, multifaceted security scenarios could form a layer of defense beyond classical security metrics like perplexity. These insights could be foundational in guiding AI safety research and generating resilient architectures against evolving adversarial techniques in the AI landscape.

This paper encourages a paradigm shift toward focusing on model robustness against intelligently crafted adversarial inputs, advocating a move away from traditional, reactionary security measures toward innovative, preemptive defenses. In summary, AutoDAN's approach marks a significant stride in highlighting and exploiting gaps in LLM safety, signaling a call to action in bolstering AI security frameworks effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Detecting language model attacks with perplexity. ArXiv, abs/2308.14132, 2023.
  2. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274–283. PMLR, 2018.
  3. Identifying and mitigating the security risks of generative AI. ArXiv, abs/2308.14840, 2023.
  4. Introduction to linear optimization, volume 6. Athena scientific Belmont, MA, 1997.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
  6. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM, September 2023.
  7. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  8. Jailbreaking Black Box Large Language Models in Twenty Queries, October 2023.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  10. DAN. Chat gpt "dan" (and other "jailbreaks"), 2023. URL https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516. GitHub repository.
  11. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots, October 2023.
  12. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  13. Misusing Tools in Large Language Models With Visual Adversarial Examples, October 2023.
  14. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, May 2023.
  15. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5747–5757, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.464. URL https://aclanthology.org/2021.emnlp-main.464.
  16. Thilo Hagendorff. Deception abilities emerged in large language models. ArXiv, abs/2307.16513, 2023. URL https://api.semanticscholar.org/CorpusID:260334697. Citation Key: Hagendorff2023DeceptionAE.
  17. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. ArXiv, abs/2308.07308, 2023.
  18. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023a.
  19. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation, October 2023b.
  20. Baseline Defenses for Adversarial Attacks Against Aligned Language Models, September 2023.
  21. Automatically Auditing Large Language Models via Discrete Optimization, March 2023.
  22. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2023.
  23. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
  24. Certifying LLM Safety against Adversarial Prompting, September 2023.
  25. Open sesame! universal black box jailbreaking of large language models. ArXiv, September 2023. doi: 10.48550/arXiv.2309.01446. URL http://arxiv.org/abs/2309.01446. arXiv:2309.01446 [cs].
  26. MNIST handwritten digit database. Tech Report, 2010.
  27. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243.
  28. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353.
  29. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW.
  30. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, October 2023a.
  31. Prompt Injection attack against LLM-integrated Applications, June 2023b.
  32. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, May 2023c.
  33. Black Box Adversarial Prompting for Foundation Models. https://arxiv.org/abs/2302.04237v2, February 2023.
  34. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  119–126, 2020.
  35. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  36. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  37. Ignore Previous Prompt: Attack Techniques For Language Models, November 2022.
  38. Automatic Prompt Optimization with "Gradient Descent" and Beam Search. ArXiv, October 2023. doi: 10.48550/arXiv.2305.03495.
  39. Visual Adversarial Examples Jailbreak Aligned Large Language Models, August 2023.
  40. Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks, May 2023.
  41. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, October 2023.
  42. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019.
  43. Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models, October 2023.
  44. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  45. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346.
  46. On the Exploitability of Instruction Tuning. ArXiv, June 2023. doi: 10.48550/arXiv.2306.17194.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  48. Florian Tramèr. Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them, June 2022.
  49. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.
  50. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations, October 2023b.
  51. Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery, June 2023.
  52. Fundamental Limitations of Alignment in Large Language Models, August 2023.
  53. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts, October 2023.
  54. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. ArXiv, August 2023. doi: 10.48550/arXiv.2308.06463. URL http://arxiv.org/abs/2308.06463. arXiv:2308.06463 [cs].
  55. Evaluating Large Language Models at Evaluating Instruction Following, October 2023.
  56. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology, 11(3), April 2020. ISSN 2157-6904. doi: 10.1145/3374217. URL https://doi.org/10.1145/3374217.
  57. Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success, July 2023.
  58. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  59. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, October 2023. doi: 10.48550/arXiv.2306.04528. URL http://arxiv.org/abs/2306.04528. arXiv:2306.04528 [cs].
  60. Representation engineering: A top-down approach to ai transparency. ArXiv, October 2023a. doi: 10.48550/arXiv.2310.01405. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
  61. Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sicheng Zhu (15 papers)
  2. Ruiyi Zhang (98 papers)
  3. Bang An (33 papers)
  4. Gang Wu (143 papers)
  5. Joe Barrow (12 papers)
  6. Zichao Wang (34 papers)
  7. Furong Huang (150 papers)
  8. Ani Nenkova (26 papers)
  9. Tong Sun (49 papers)
Citations (30)
Youtube Logo Streamline Icon: https://streamlinehq.com