Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (2402.16192v2)

Published 25 Feb 2024 in cs.CL

Abstract: Aligned LLMs are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  2. Andriushchenko, M. Adversarial attacks on gpt-4 via simple random search. 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023.
  5. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022.
  6. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  7. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  8. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  10. Certified adversarial robustness via randomized smoothing. International Conference on Machine Learning, 2019.
  11. Cyphert, A. B. A human being wrote this law review article: Gpt-3 and the practice of law. UC Davis L. Rev., 55:401, 2021.
  12. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023.
  13. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  14. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  16. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  17. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
  18. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  19. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  20. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv: 2309.01446, 2023.
  21. AlpacaEval: An Automatic Evaluator of Instruction-following Models.
  22. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023.
  23. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  15009–15018, 2023.
  24. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023.
  25. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
  26. Newman, J. A taxonomy of trustworthiness for artificial intelligence. CLTC: North Charleston, SC, USA, 2023.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  29. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  30. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
  31. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  32. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
  33. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  34. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  35. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  36. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
  39. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
  40. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  41. Randomized smoothing of all shapes and sizes. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  10693–10705. PMLR, 13-18 Jul 2020. URL https://proceedings.mlr.press/v119/yang20c.html.
  42. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003, 2023.
  43. Safer: A structure-free approach for certified robustness to adversarial word substitutions. arXiv preprint arXiv:2005.14424, 2020.
  44. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023a.
  45. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv: 2310.02446, 2023b.
  46. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv: 2309.10253, 2023.
  47. Certified robustness to text adversarial attacks by randomized [mask]. Computational Linguistics, 2023.
  48. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv: 2401.06373, 2024.
  49. Certified robustness for large language models with self-denoising. arXiv preprint arXiv:2307.07171, 2023.
  50. Instruction-following evaluation for large language models. arXiv preprint arXiv: 2311.07911, 2023.
  51. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv: 2310.15140, 2023.
  52. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiabao Ji (13 papers)
  2. Bairu Hou (14 papers)
  3. Alexander Robey (34 papers)
  4. George J. Pappas (208 papers)
  5. Hamed Hassani (120 papers)
  6. Yang Zhang (1129 papers)
  7. Eric Wong (47 papers)
  8. Shiyu Chang (120 papers)
Citations (28)