Certifying LLM Safety against Adversarial Prompting (2309.02705v3)

Published 6 Sep 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: LLMs are vulnerable to adversarial attacks that add malicious tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees. Given a prompt, our procedure erases tokens individually and inspects the resulting subsequences using a safety filter. Our safety certificate guarantees that harmful prompts are not mislabeled as safe due to an adversarial attack up to a certain size. We implement the safety filter in two ways, using Llama 2 and DistilBERT, and compare the performance of erase-and-check for the two cases. We defend against three attack modes: i) adversarial suffix, where an adversarial sequence is appended at the end of a harmful prompt; ii) adversarial insertion, where the adversarial sequence is inserted anywhere in the middle of the prompt; and iii) adversarial infusion, where adversarial tokens are inserted at arbitrary positions in the prompt, not necessarily as a contiguous block. Our experimental results demonstrate that this procedure can obtain strong certified safety guarantees on harmful prompts while maintaining good empirical performance on safe prompts. Additionally, we propose three efficient empirical defenses: i) RandEC, a randomized subsampling version of erase-and-check; ii) GreedyEC, which greedily erases tokens that maximize the softmax score of the harmful class; and iii) GradEC, which uses gradient information to optimize tokens to erase. We demonstrate their effectiveness against adversarial prompts generated by the Greedy Coordinate Gradient (GCG) attack algorithm. The code for our experiments is available at https://github.com/aounon/certified-LLM-safety.

PDF Abstract

Certifying LLM Safety against Adversarial Prompting

LLMs have demonstrated numerous capabilities but remain susceptible to adversarial attacks, one of the most concerning being adversarial prompting. These attacks manipulate input prompts by adding malicious tokens that bypass established safety mechanisms to provoke harmful responses. The paper "Certifying LLM Safety against Adversarial Prompting" provides a pioneering framework termed "erase-and-check" designed to defend LLMs against adversarial prompts while ensuring certifiable safety guarantees.

Framework Overview

The proposed "erase-and-check" procedure systematically addresses the vulnerabilities of LLMs from adversarial prompting attacks. The process involves iteratively erasing tokens from an input prompt and inspecting the resulting subsequences with a safety filter to certify the model's safety. An input, or its modified versions up to a certain adversarial size limit, is labeled harmful if any subsequence or the original prompt is flagged as harmful. This guarantees that adversarial modifications cannot mislead the system into producing unsafe outputs.

The safety filter is implemented using two approaches, utilizing Llama~2 and DistilBERT, showcasing different aspects regarding efficacy and computational efficiency.

Attack Modes and Evaluation

The paper evaluates three adversarial attack modes:

Adversarial Suffix: An adversarial sequence is appended at the end. The erase-and-check approach was shown to rigorously detect over 92% of harmful prompts, ensuring that an appended suffix could not evade detection.
Adversarial Insertion: Adversarial sequences are inserted at various locations within the input. This model expands upon the complexity and demonstrates higher resilience against attacks, with the erase-and-check keeping robust results and computational feasibility.
Adversarial Infusion: This involves inserting tokens at arbitrary positions, widely increasing difficulty and computational needs. Despite these challenges, the mechanism still offered solid guarantees on harmful prompt identification.

The efficacy of the framework relies heavily on the accuracy of the internal safety filter, with DistilBERT outperforming Llama~2 significantly in terms of speed and detection precision.

Empirical Defenses

To complement the certified defenses, three empirical variants were designed to facilitate computational efficiency:

RandEC: A randomized subset approach of erase-and-check, improving speed while it maintained high detection rates against adversarial harmful prompts.
GreedyEC: A greedy method optimizing the erasure process to prioritize harmful detection using softmax scores maximization.
GradEC: This leverages gradient-based optimization to efficiently identify erasable token sequences enhancing harmful prompt detection.

These empirical methods show considerable promise in balancing time-complexity and detection accuracy, adaptable to multiple threat scenarios.

Implications and Future Work

The "erase-and-check" framework sets a substantial precedent for certifying LLM safety amidst adversarial attacks. The theoretical and empirical success in certifying models against specific attack sizes not only emphasizes potential practical applications but also demonstrates a scalable path toward more robust future AI governance mechanisms.

Future developments may focus on broadening the spectrum of certified safety mechanisms in LLMs, exploring general threat models beyond simple prompt modifications, and seeking efficient algorithms that can further enhance the practicality of adversarial defenses.

Overall, the paper establishes a critical foundation in the effort to assure LLM safety, prompting further research into extending such certified defenses to encompass even more complex models and adversarial conditions.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Aounon Kumar (16 papers)
Chirag Agarwal (39 papers)
Suraj Srinivas (28 papers)
Aaron Jiaxun Li (1 paper)
Soheil Feizi (127 papers)
Himabindu Lakkaraju (88 papers)

Citations (124)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - aounon/certified-llm-safety (24 stars)

Tweets

https://twitter.com/AmirFuturist/status/1880660167924326863