Are aligned neural networks adversarially aligned? (2306.15447v2)

Published 26 Jun 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: LLMs are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

PDF HTML Abstract

Introduction to Aligned Neural Networks

Aligned neural networks are designed to produce outputs that are in line with the intentions and ethical standards established by their creators. For LLMs, alignment means generating responses that are helpful to user queries while avoiding harmful content. Attempting to craft LLMs that behave in such a way has led to the application of various techniques like reinforcement learning through human feedback (RLHF). These efforts aim to ensure the models' outputs stay within the boundaries of what is deemed acceptable and avoid biases or toxicity. However, despite these efforts, no LLM is entirely safe from being manipulated into producing undesirable outputs through what are known as adversarial examples.

Adversarial Examples: A Challenge to Alignment

Adversarial examples are inputs tailored to trick neural networks into performing actions or generating outputs that they ordinarily wouldn't. Historically, this type of vulnerability has been extensively explored in the image recognition field. Such examples showcase how minute changes to an input image, imperceptible to the human eye, can lead to incorrect classification by the neural network. Researchers have extended this phenomenon to the domain of language, where adversarial inputs can be constructed to coax models into emitting harmful outputs. This raises a critical question: despite advanced alignment techniques, can LLMs maintain their alignment when confronted with these adversarily crafted inputs?

Evaluating the Robustness of Aligned Models

Recent investigations reveal that while current alignment strategies can defend against state-of-the-art text-based adversarial attacks, these attacks may not be powerful enough to be considered comprehensive tests for adversarial robustness. In essence, the successful defense against current attacks should not impart false confidence in the alignment of LLMs under all possible adversarial scenarios. In the face of adversarial users, even well-aligned models have shown some weaknesses, indicating that our ability to assess their robustness accurately remains incomplete.

The New Frontier: Multimodal Models

The paper emphasizes a shift towards multimodal models, which combine text and images or other data types in their inputs. These models open new avenues for user interaction but also present additional vulnerabilities. The research detailed in the paper illustrates that adversarial attacks using perturbed images can be especially effective against multimodal systems, causing them to generate harmful content more easily than with text alone. Unfortunately, current attacks are still lacking in effectively challenging text-only models, suggesting a gap in our understanding and prompting a need for the development of more robust attack methods to properly evaluate these LLMs.

In conclusion, while the alignment of neural networks signifies progress in pursuing more ethical AI, ensuring their robustness against adversarially designed prompts remains a significant challenge, particularly in multimodal contexts. Future research is urged to focus on refining adversarial attacks for a more accurate assessment of models' abilities to uphold their alignment in all circumstances.