Universal and Transferable Adversarial Attacks on Aligned Language Models (2307.15043v2)
Abstract: Because "out-of-the-box" LLMs are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned LLMs to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned LLMs, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/LLM-attacks/LLM-attacks.
- Demystifying limited adversarial transferability in automatic speech recognition systems. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=l5aSHXi8jG5.
- Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 387–402. Springer, 2013.
- Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pages 3–14, 2017a.
- Towards evaluating the robustness of neural networks, 2017b.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
- CarperAI. Stable-vicuna 13b, 2023. URL https://huggingface.co/CarperAI/stable-vicuna-13b-delta.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Certified adversarial robustness via randomized smoothing. In international conference on machine learning. PMLR, 2019.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
- Aligning {ai} with shared human values. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=dNy_RKzJacY.
- Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
- Globally-robust neural networks. In International Conference on Machine Learning. PMLR, 2021.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Sok: Certified robustness for deep neural networks. In 2023 IEEE Symposium on Security and Privacy (SP), 2023.
- Exploring targeted universal adversarial perturbations to end-to-end asr models. arXiv preprint arXiv:2104.02757, 2021.
- Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
- Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023.
- Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1765–1773, 2017.
- Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828, 2019.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016a.
- The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), 2016b.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
- Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems, 35:9538–9551, 2022.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
- Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840, 2021.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.