Adversarial Suffix Filtering: a Defense Pipeline for LLMs (2505.09602v1)

Published 14 May 2025 in cs.LG and cs.CR

Abstract: LLMs are increasingly embedded in autonomous systems and public-facing environments, yet they remain susceptible to jailbreak vulnerabilities that may undermine their security and trustworthiness. Adversarial suffixes are considered to be the current state-of-the-art jailbreak, consistently outperforming simpler methods and frequently succeeding even in black-box settings. Existing defenses rely on access to the internal architecture of models limiting diverse deployment, increase memory and computation footprints dramatically, or can be bypassed with simple prompt engineering methods. We introduce $\textbf{Adversarial Suffix Filtering}$ (ASF), a lightweight novel model-agnostic defensive pipeline designed to protect LLMs against adversarial suffix attacks. ASF functions as an input preprocessor and sanitizer that detects and filters adversarially crafted suffixes in prompts, effectively neutralizing malicious injections. We demonstrate that ASF provides comprehensive defense capabilities across both black-box and white-box attack settings, reducing the attack efficacy of state-of-the-art adversarial suffix generation methods to below 4%, while only minimally affecting the target model's capabilities in non-adversarial scenarios.

PDF Abstract

Formal Summary of the Paper: Adversarial Suffix Filtering: A Defense Pipeline for LLMs

David Khachaturov and Robert Mullins from the University of Cambridge present a robust and computationally efficient strategy for safeguarding LLMs against adversarial suffix attacks. Recognizing the pervasive deployment of LLMs in both public-facing and autonomous systems, this paper addresses the critical security vulnerability posed by adversarial suffixes, which are advanced forms of prompt injection attacks. The authors introduce a novel model-agnostic defensive pipeline named Adversarial Suffix Filtering (ASF), which promises significant advantages in mitigating such vulnerabilities.

Key Contributions

Adversarial Suffix Filtering (ASF): The focal contribution of the paper is the ASF pipeline, which functions as an input preprocessor and sanitizer to detect and neutralize adversarially crafted suffixes in prompts. By filtering out these malicious suffixes before they can influence the LLM's output, ASF aims to significantly reduce the efficacy of adversarial suffixes, which are currently the state-of-the-art method for jailbreaking LLMs.
Comprehensive Defense Capability: ASF has demonstrated a reduction in the attack success rate of adversarial suffixes to below 4% across both black-box and white-box settings while maintaining the LLM's performance in non-adversarial scenarios. This performance is achieved without necessitating insight or modification to the internal architecture of the model being protected, making it versatile across different deployment environments.
Lightweight and Model-Agnostic: The ASF approach is distinguished by its lightweight nature, offering a significant advantage over existing defenses that either require extensive retraining or can be bypassed through simple prompt engineering. ASF maintains efficiency in terms of memory and computation requirements, thereby being feasible for deployment even in resource-constrained settings.
Experimental Validation and Deployment Feasibility: The experimental results are robust, using benchmark datasets like MaliciousInstruct and AdvBench to validate ASF's effectiveness in real-world settings. The experimental setup and results, delineated across various LLM architectures, provide strong evidence of ASF's utility without degrading base model accuracy in typical non-adversarial tasks.

Practical Implications and Future Directions

This research implies notable practical applications in enhancing the security of LLM-integrated systems. Its benefits are apparent in environments where model specifics cannot be altered, such as systems utilizing commercial and closed-source LLMs. The ability to deploy ASF in trusted execution environments could secure these models from adversarial manipulation even when integrated with untrusted front-end applications.

Despite its merits, ASF presents some limitations that suggest avenues for further work. Its focus on suffix-type adversarial attacks may necessitate expansion to other forms of prompt attacks to provide holistic protection. Additionally, occasional false positives, where benign prompts might be incorrectly flagged, indicate room for refining the classification heuristics and model training.

The potential for integration and the pressing need for secure NLP systems reinforce the importance of this defense pipeline. Going forward, continuing to refine and expand ASF's capacities, adjusting to evolving adversarial techniques, and optimizing for broader NLP tasks will be essential to maintaining its relevance and effectiveness. The groundwork laid by Khachaturov and Mullins provides a solid foundation for such advancements and highlights the ongoing challenge of safeguarding AI systems in dynamic application environments.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

David Khachaturov (9 papers)
Robert Mullins (38 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos