Certifying LLM Safety against Adversarial Prompting
LLMs have demonstrated numerous capabilities but remain susceptible to adversarial attacks, one of the most concerning being adversarial prompting. These attacks manipulate input prompts by adding malicious tokens that bypass established safety mechanisms to provoke harmful responses. The paper "Certifying LLM Safety against Adversarial Prompting" provides a pioneering framework termed "erase-and-check" designed to defend LLMs against adversarial prompts while ensuring certifiable safety guarantees.
Framework Overview
The proposed "erase-and-check" procedure systematically addresses the vulnerabilities of LLMs from adversarial prompting attacks. The process involves iteratively erasing tokens from an input prompt and inspecting the resulting subsequences with a safety filter to certify the model's safety. An input, or its modified versions up to a certain adversarial size limit, is labeled harmful if any subsequence or the original prompt is flagged as harmful. This guarantees that adversarial modifications cannot mislead the system into producing unsafe outputs.
The safety filter is implemented using two approaches, utilizing Llama~2 and DistilBERT, showcasing different aspects regarding efficacy and computational efficiency.
Attack Modes and Evaluation
The paper evaluates three adversarial attack modes:
- Adversarial Suffix: An adversarial sequence is appended at the end. The erase-and-check approach was shown to rigorously detect over 92% of harmful prompts, ensuring that an appended suffix could not evade detection.
- Adversarial Insertion: Adversarial sequences are inserted at various locations within the input. This model expands upon the complexity and demonstrates higher resilience against attacks, with the erase-and-check keeping robust results and computational feasibility.
- Adversarial Infusion: This involves inserting tokens at arbitrary positions, widely increasing difficulty and computational needs. Despite these challenges, the mechanism still offered solid guarantees on harmful prompt identification.
The efficacy of the framework relies heavily on the accuracy of the internal safety filter, with DistilBERT outperforming Llama~2 significantly in terms of speed and detection precision.
Empirical Defenses
To complement the certified defenses, three empirical variants were designed to facilitate computational efficiency:
- RandEC: A randomized subset approach of erase-and-check, improving speed while it maintained high detection rates against adversarial harmful prompts.
- GreedyEC: A greedy method optimizing the erasure process to prioritize harmful detection using softmax scores maximization.
- GradEC: This leverages gradient-based optimization to efficiently identify erasable token sequences enhancing harmful prompt detection.
These empirical methods show considerable promise in balancing time-complexity and detection accuracy, adaptable to multiple threat scenarios.
Implications and Future Work
The "erase-and-check" framework sets a substantial precedent for certifying LLM safety amidst adversarial attacks. The theoretical and empirical success in certifying models against specific attack sizes not only emphasizes potential practical applications but also demonstrates a scalable path toward more robust future AI governance mechanisms.
Future developments may focus on broadening the spectrum of certified safety mechanisms in LLMs, exploring general threat models beyond simple prompt modifications, and seeking efficient algorithms that can further enhance the practicality of adversarial defenses.
Overall, the paper establishes a critical foundation in the effort to assure LLM safety, prompting further research into extending such certified defenses to encompass even more complex models and adversarial conditions.