Universal and Transferable Adversarial Attacks on Aligned Language Models (2307.15043v2)

Published 27 Jul 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: Because "out-of-the-box" LLMs are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned LLMs to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned LLMs, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/LLM-attacks/LLM-attacks.

References (49)

Citations (936)

View on Semantic Scholar

Summary

The paper introduces a novel automated adversarial method combining greedy and gradient-based techniques to generate harmful outputs from aligned language models.
It demonstrates that adversarial prompts crafted for one model remain effective across multiple systems, including ChatGPT and Bard.
Findings call for improved alignment strategies to fortify defenses against increasingly transferable adversarial attacks.

Introduction to Adversarial Attacks on LLMs

LLMs, such as GPT-3 and BERT, have advanced to a stage where they're increasingly being used in various applications, providing users with information, entertainment, and interaction. At the same time, it's crucial to ensure these models do not generate harmful or objectionable content. Organizations developing these models have put in considerable effort to "align" their outputs with socially acceptable standards. Despite these efforts, certain inputs, known as adversarial attacks, can lead to model misalignment, causing the generation of undesirable content. This article explores a new method that automates the process of creating these adversarial attacks, revealing vulnerabilities in these aligned models.

Crafting Automated Adversarial Prompts

Researchers have proposed a novel adversarial method that exploits the weaknesses in LLMs and provokes them into generating content that is generally filtered out for being objectionable. Unlike previous techniques, which mainly depended on human creativity and were not highly adaptable, the new method uses a clever combination of greedy and gradient-based techniques to automatically produce adversarial prompts. These prompts include a suffix that, when attached to otherwise innocuous queries, substantially increases the probability that the LLM wrongfully responds with harmful content. This method surpasses past automated prompt-generation methods by successfully inducing a range of LLMs to generate such objectionable content with high consistency.

Transferability of Adversarial Prompts

What makes these findings even more compelling is the high degree of transferability observed. The adversarial prompts designed for one model were found to be effective on others, including closed-source models available publicly, such as OpenAI's ChatGPT and Google's Bard. Specifically designed by optimizing against several smaller LLMs, these adversarial prompts maintain their efficacy when tested on larger and more sophisticated models. This surprising level of transferability highlights a broad vulnerability in LLMs, which raises important questions about the methods used to align them and their robustness against such insidious inputs.

Ethical Considerations and Potential Consequences

As one might expect, the ethical implications of this research are significant. The authors addressed this by engaging with various AI labs and sharing their findings before publication. Introducing these vulnerabilities into public discourse is critical, as understanding these potential attack vectors can lead to better defenses. Nonetheless, it is essential to note that the results also point to the need for a continued search for more secure and foolproof methods to prevent adversarial attacks on LLMs, which are becoming more integrated into our digital lives.

In conclusion, this paper marks a significant step forward in the field of machine learning security. By automating the generation of adversarial attacks and revealing their high transferability between models, it opens new avenues towards strengthening the alignment of LLMs, ensuring they adhere to ethical guidelines and resist manipulation despite the increasing complexity and the evolving landscape of AI-driven communication.