Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (2401.17263v5)

Published 30 Jan 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Despite advances in AI alignment, LLMs remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo

View on arXiv

Authors (3)

Andy Zhou (23 papers)
Bo Li (1107 papers)
Haohan Wang (96 papers)

Citations (55)

View on Semantic Scholar

Summary

Introduction

The security of LMs (LLMs) is increasingly scrutinized as they become integral in various applications, raising concerns about their potential exploitation. In this context, Zhou et al. present an innovative approach to fortify LMs against adversarial attacks that attempt to generate harmful content, commonly known as "jailbreaking." Their paper introduces a novel adversarial objective catered toward strengthening LMs and proposes an algorithm called robust prompt optimization (RPO). The crux of RPO is to use gradient-based token optimization to ensure safe outputs from LMs, thus acting as a defense mechanism that adversaries find challenging to circumvent.

Related Work and Background

The paper contextualizes its contributions within the field of adversarial robustness, particularly in NLP. It references established defenses against adversarial examples in computer vision and recognizes their limited applicability to LMs, mainly due to the discrete nature of text data. The paper critiques existing defenses for LMs as lacking in generalizability and practicality. Prior work typically focused on input preprocessing, distillation, and adversarial training, with the latter being empirically successful. However, the authors argue that adversarial attacks on LMs, such as manual or gradient-based jailbreaks, necessitate novel defenses beyond those used against adversarial examples in vision.

Approach and Contributions

The core idea put forth is a defense that is universal, practical, and effective. The paper establishes a realistic threat model encompassing adaptive adversaries capable of producing a variety of attacks. To counter such threats, they formalize a joint minimax defense objective and propose RPO, a mechanism to induce benign behavior under adversarial conditions. The algorithm improves robustness by generating defensive suffixes that, when appended to prompts, maintain harmless outputs from LMs. Empirically, RPO effectively defends against various known and unknown jailbreaks. It reduces attack success rates (ASR) dramatically, e.g., lowering the ASR on the Starling-7B model from 84% to 8.66% across 20 jailbreaks, marking a notable advancement over previous state-of-the-art defenses.

Experimental Findings

The empirical evidence strongly supports the claimed benefits of RPO. When evaluated on a set of jailbreaks unseen during optimization, the defense sustains a substantive reduction in ASRs for jailbreaks, including the highly optimized Quack attack on GPT-4, decreasing its ASR from 92% to 6%. Furthermore, RPO exhibits a negligible impact on the benign usage of LMs, emphasizing its practicality. The authors demonstrate the transferability of RPO to multiple LMs, including GPT-4, Llama-2, and Vicuna-7B, consolidating its reputation as a universal defense. Alongside, it remains robust under adaptive attacks, which modify their strategies to overcome RPO.

Conclusion

Zhou et al.'s work addresses the pressing need for secure LMs in a manner unexplored by previous research. RPO, with its practical and universal nature, sets a new standard for defending against jailbreaking. This breakthrough prompts a reassessment of adversarial attacks and defenses in NLP, encouraging the adoption of prompt optimization strategies as a robust safeguard. The research pivots the community towards acknowledging that defense against jailbreaking is manageable when approached thoughtfully, and it may pave the way for more robust and trustable deployment of LMs across various domains.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1752552142273913103

https://twitter.com/andyz245/status/1840837586803441691

https://twitter.com/topofmlsafety/status/1752719386031325363

https://twitter.com/fly51fly/status/1753789594901078213

https://twitter.com/brianryhuang/status/1773448902899077571

https://twitter.com/arxivsanitybot/status/1752684162870288562