Introduction
The security of LMs (LLMs) is increasingly scrutinized as they become integral in various applications, raising concerns about their potential exploitation. In this context, Zhou et al. present an innovative approach to fortify LMs against adversarial attacks that attempt to generate harmful content, commonly known as "jailbreaking." Their paper introduces a novel adversarial objective catered toward strengthening LMs and proposes an algorithm called robust prompt optimization (RPO). The crux of RPO is to use gradient-based token optimization to ensure safe outputs from LMs, thus acting as a defense mechanism that adversaries find challenging to circumvent.
Related Work and Background
The paper contextualizes its contributions within the field of adversarial robustness, particularly in NLP. It references established defenses against adversarial examples in computer vision and recognizes their limited applicability to LMs, mainly due to the discrete nature of text data. The paper critiques existing defenses for LMs as lacking in generalizability and practicality. Prior work typically focused on input preprocessing, distillation, and adversarial training, with the latter being empirically successful. However, the authors argue that adversarial attacks on LMs, such as manual or gradient-based jailbreaks, necessitate novel defenses beyond those used against adversarial examples in vision.
Approach and Contributions
The core idea put forth is a defense that is universal, practical, and effective. The paper establishes a realistic threat model encompassing adaptive adversaries capable of producing a variety of attacks. To counter such threats, they formalize a joint minimax defense objective and propose RPO, a mechanism to induce benign behavior under adversarial conditions. The algorithm improves robustness by generating defensive suffixes that, when appended to prompts, maintain harmless outputs from LMs. Empirically, RPO effectively defends against various known and unknown jailbreaks. It reduces attack success rates (ASR) dramatically, e.g., lowering the ASR on the Starling-7B model from 84% to 8.66% across 20 jailbreaks, marking a notable advancement over previous state-of-the-art defenses.
Experimental Findings
The empirical evidence strongly supports the claimed benefits of RPO. When evaluated on a set of jailbreaks unseen during optimization, the defense sustains a substantive reduction in ASRs for jailbreaks, including the highly optimized Quack attack on GPT-4, decreasing its ASR from 92% to 6%. Furthermore, RPO exhibits a negligible impact on the benign usage of LMs, emphasizing its practicality. The authors demonstrate the transferability of RPO to multiple LMs, including GPT-4, Llama-2, and Vicuna-7B, consolidating its reputation as a universal defense. Alongside, it remains robust under adaptive attacks, which modify their strategies to overcome RPO.
Conclusion
Zhou et al.'s work addresses the pressing need for secure LMs in a manner unexplored by previous research. RPO, with its practical and universal nature, sets a new standard for defending against jailbreaking. This breakthrough prompts a reassessment of adversarial attacks and defenses in NLP, encouraging the adoption of prompt optimization strategies as a robust safeguard. The research pivots the community towards acknowledging that defense against jailbreaking is manageable when approached thoughtfully, and it may pave the way for more robust and trustable deployment of LMs across various domains.