PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (2402.15911v1)

Published 24 Feb 2024 in cs.CR and cs.CL

Abstract: LLMs are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.

PDF HTML Abstract

Propagating Universal Perturbations to Overcome LLM Guard Models

Introduction to Attack on Guard Models

The deployment of LLMs in real-world applications necessitates robust mechanisms to ensure their safe interaction with users. A novel strategy of implementing a secondary reviewing LLM, dubbed a Guard Model, aims at moderating the primary LLM's output to filter out harmful content. However, the paper presented in "PRP: Propagating Universal Perturbations to Attack LLM Guard-Rails" uncovers a methodical approach to circumvent these guardrails, challenging the reliability of these defense mechanisms.

Attack Mechanism Overview

The paper elucidates a two-step attack strategy, named PRP, that constructs and leverages universal adversarial prefixes to deceive Guard Models. Initially, it identifies a universal adversarial prefix that, when appended to any input, camouflages the harmfulness of the content, thereby dodging detection by the Guard Model. Subsequently, the attack exploits the in-context learning capabilities of the base LLM to ensure that its response begins with this universal adversarial prefix. This cunning strategy enables harmful responses to bypass the Guard Model's scrutiny.

Evaluation Results

The application of PRP across various Guard-Railed LLM configurations, both open and closed-source, showcases its high efficacy. Notably, experiments demonstrate that PRP can achieve a remarkable 80% jailbreak success rate on LLM configurations that include both Llama 2 and closed-source models like GPT 3.5 as Guard Models. This starkly contrasts with the substantially lower success rates when conventional attacks are applied, highlighting the vulnerability of Guard Models to the devised strategy.

Implications and Theoretical Insights

The paper raises critical concerns about the current state of Guard-Railed LLMs and casts doubt on the effectiveness of Guard Models as reliable defense mechanisms against sophisticated attacks. The findings stress the need for more advanced and perhaps fundamentally different approaches to ensure the safe deployment of LLMs in sensitive and interactive applications. Moreover, the paper underscores the significance of continuous and dynamic security assessments for LLMs, advocating for a shift towards developing adaptive and resilient defense strategies.

Future Landscape of AI Safety

The disclosure of PRP ignites a vital discourse on the necessity of bolstering the defenses of Guard Models against adversarial manipulations. It sets the stage for further research into more robust and impermeable guardrails for LLMs, potentially leveraging insights from adversarial training, differential privacy, or other novel AI safety techniques. Additionally, the paper subtly prompts an exploration of alternative safety paradigms that do not solely rely on post-hoc moderation by another LLM but instead integrate safety principles more intrinsically into the model's architecture or training process.

Closing Thoughts

The unveiling of PRP as a potent method to undermine Guard-Railed LLMs poses crucial questions about the sufficiency and robustness of current LLM safety mechanisms. This investigation not only serves as a clarion call for the AI research community to prioritize the development of more reliable defenses but also enriches our understanding of the vulnerabilities inherent to LLMs. As we venture further into deploying LLMs across various domains, ensuring their safe and ethical use remains an indispensable goal that demands concerted and innovative efforts.