Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (2402.16192v2)

Published 25 Feb 2024 in cs.CL

Abstract: Aligned LLMs are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (52)

Authors (8)

Jiabao Ji (13 papers)
Bairu Hou (14 papers)
Alexander Robey (34 papers)
George J. Pappas (208 papers)
Hamed Hassani (120 papers)
Yang Zhang (1129 papers)
Eric Wong (47 papers)
Shiyu Chang (120 papers)

Citations (28)

View on Semantic Scholar

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (2402.16192v2)

Related Papers