Coercing LLMs to do and reveal (almost) anything (2402.14020v1)

Published 21 Feb 2024 in cs.LG, cs.CL, and cs.CR

Abstract: It has recently been shown that adversarial attacks on LLMs can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.

View on arXiv

Authors (6)

Jonas Geiping (73 papers)
Alex Stein (5 papers)
Manli Shu (23 papers)
Khalid Saifullah (19 papers)
Yuxin Wen (33 papers)
Tom Goldstein (226 papers)

Citations (34)

View on Semantic Scholar

Summary

Coercing LLMs to Reveal and Perform Almost Any Action

Introduction to Adversarial Attacks on LLMs

LLMs are becoming central to various applications, from conversational chatbots to educational assistants. However, the very capacity that enables these models to understand and generate human-like text also exposes them to vulnerabilities. This paper focuses on adversarial attacks that manipulate LLMs into performing unintended actions or revealing sensitive information.

Scope and Methodology of Adversarial Attacks

Adversarial attacks on LLMs are not restricted to extracting hidden information or bypassing content filters. A comprehensive range of attacks includes:

Misdirection attacks, which can make LLMs offer malicious instructions or URLs.
Model control attacks that alter the behavior of LLMs, such as generating endless text to exhaust computational resources.
Denial-of-service (DoS) attacks, designed to impair the functionality of LLMs or the platforms they operate on.
Data extraction attacks, where LLMs divulge information that should remain concealed, like system prompts or sensitive context data.

Common Strategies Employed in Adversarial Attacks

Adversarial tokens used in these attacks reveal certain patterns. Some key strategies include:

Programming and reprogramming, where attackers exploit the LLM's coding knowledge to trick it into executing commands.
Language switching, leveraging LLMs' multilingual capabilities to obfuscate malicious intents.
Role hacking, confusing models about the boundaries between system, user, and model-generated content.
Utilization of glitch tokens, exploiting tokenization flaws to trigger unexpected model responses.
Appeals to authority and calls to action, manipulating the LLM into compliance by mimicking authoritative or urgent requests.

These techniques often mimic or extend the findings from extensive manual red-teaming efforts. Notably, the attacks revealing the usage of glitch tokens in LLM vocabularies underscore critical weaknesses in tokenizer training and model data quality.

The Impact of Attack Strategies on LLM Security

The effectiveness of adversarial attacks poses significant challenges to the deployment of LLMs in real-world scenarios. For example, misdirection attacks can lead users to malicious sites, while DoS attacks could impose financial and operational burdens on service providers. The discovery and execution of these attacks emphasize the need for more rigorous security measures and model training procedures that account for such vulnerabilities.

Defense Mechanisms and Future Considerations

Emerging defense strategies range from detecting anomalies in token patterns to adapting models' responses to minimize damage from potential attacks. Future research must prioritize understanding the breadth of possible attacks and developing robust defenses. This includes scrutinizing the LLM's tokenization process, improving model alignment methodologies, and considering the ethical implications of deploying LLMs in sensitive or critical applications.

Concluding Thoughts

This paper offers a detailed exploration of the adversarial landscape threatening LLMs, expanding the dialogue beyond jailbreak scenarios. By systematically categorizing and analyzing diverse attack methodologies, it sheds light on the multifaceted security challenges facing modern AI systems. As LLMs continue to permeate various sectors, acknowledging and addressing these vulnerabilities becomes paramount to ensure their safe and beneficial application.