Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks (2310.03684v4)

Published 5 Oct 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Despite efforts to align LLMs with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-LLM}.

Essay: SmoothLLM: Addressing Jailbreaking Vulnerabilities in LLMs

The paper "SmoothLLM: Defending LLMs Against Jailbreaking Attacks" addresses a significant vulnerability associated with LLMs such as GPT, Llama, Claude, and PaLM—namely, their susceptibility to jailbreaking attacks. These attacks allow adversaries to trick LLMs into generating inappropriate or objectionable content, despite ongoing alignment efforts with human values. The authors propose SmoothLLM, a novel defensive algorithm that effectively mitigates these vulnerabilities without introducing unnecessary conservatism and maintains efficiency, making it compatible with a broad range of LLMs.

Overview of Jailbreaking Attacks

LLMs are powerful generative models trained on massive corpora of text data. Despite efforts to align their outputs with ethical and legal standards, they are not foolproof. Jailbreaking attacks exploit these models by manipulating prompts to bypass their safety restrictions, often through adversarial prompting where specific sequences of characters induce unwanted behavior. The authors highlight adversarial attacks like those introduced by Zou et al., where carefully crafted suffixes appended to prompts can lead LLMs to generate harmful text.

Proposed Defense: SmoothLLM

SmoothLLM is designed to address these adversaries by leveraging the brittleness of adversarial prompts to character-level perturbations. The defense involves duplicating and randomly perturbing input prompts, then aggregating the resulting predictions to detect and neutralize adversarial inputs. This method effectively reduces the attack success rate to below one percent for several state-of-the-art LLMs, including Llama2, Vicuna, GPT-3.5, and more. Notably, SmoothLLM employs exponentially fewer queries than existing attacks, showcasing its efficiency and practicality.

Key Contributions

  1. Comprehensive Desiderata for Defenses: The authors propose a set of criteria—attack mitigation, non-conservatism, efficiency, and compatibility—that any LLM defense should satisfy. This framework emphasizes empirical robustness, avoiding undue conservatism, maintaining efficiency, and ensuring universal applicability to various architectures and settings.
  2. Empirical and Theoretical Validation: The authors support their assertions with both empirical evaluations and theoretical guarantees. SmoothLLM demonstrates substantial reductions in attack success rates across multiple LLMs. Theoretical robustness guarantees are derived based on realistic models of perturbation stability, providing high-probability assurances of effectiveness against suffix-based attacks.
  3. Efficiency and Applicability: The paper highlights the remarkable efficiency of SmoothLLM, noting it requires far fewer queries than the attacks it defends against. The method's simplicity allows for breadth in applicability, making it ideal for deployment across diverse LLMs, including those accessible only via APIs like GPT and Claude.

Experimental Analysis

Experimental results validate SmoothLLM's effectiveness. For instance, on Llama2, SmoothLLM achieves nearly a 100-fold reduction in attack success compared to the original undefended model. The defense is tested against adaptive attacks as well—demonstrating resilience by maintaining low attack success rates even when tested with strategies that specifically target the smoothing approach. Moreover, SmoothLLM is evaluated on unrelated standard NLP benchmarks to confirm that it does not unduly hinder the model's performance on non-adversarial, ethical inputs.

Implications and Future Directions

The development of SmoothLLM marks a significant step toward robust and reliable LLM deployment. By addressing known vulnerabilities without substantially degrading model performance, SmoothLLM serves as a framework that guides future defense mechanisms. This work has practical implications for enhancing the security and reliability of AI systems, particularly as they are increasingly integrated into sensitive applications in education, healthcare, and business.

Going forward, expanding upon SmoothLLM could involve exploring additional perturbative strategies or configurations to further adapt its performance metrics. Moreover, with the rapid evolution of adversarial attacks, constant iteration and evaluation of defenses like SmoothLLM will be crucial in maintaining the security envelope around powerful AI systems.

In conclusion, "SmoothLLM: Defending LLMs Against Jailbreaking Attacks" presents a methodologically sound, efficient, and universal solution to a pervasive issue in modern AI—protection against jailbreaking attacks—while setting a precedent for future research and development in adversarial robustness and defense strategies for LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  2. Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4, 2016.
  3. Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
  4. Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020.
  5. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1112–1123, 2023.
  6. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
  7. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  8. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  9. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  10. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  11. Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023.
  12. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
  13. Risks of ai foundation models in education. arXiv preprint arXiv:2110.10024, 2021.
  14. Malik Sallam. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, volume 11, page 887. MDPI, 2023.
  15. Som Biswas. Chatgpt and the future of medical writing, 2023.
  16. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  17. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023.
  18. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  19. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
  20. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019.
  21. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
  22. A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39, 2023.
  23. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
  24. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
  25. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
  26. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  27. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  28. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  29. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  30. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  31. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  32. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
  33. Denoised smoothing: A provable defense for pretrained classifiers. Advances in Neural Information Processing Systems, 33:21945–21957, 2020.
  34. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022.
  35. Cade Metz. Researchers poke holes in safety controls of chatgpt and other chatbots, Jul 2023.
  36. Will Knight. A new attack impacts chatgpt-and no one knows how to stop it, Aug 2023.
  37. Matt Burgess. Generative ai’s biggest security flaw is not easy to fix, Sep 2023.
  38. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  39. Jonathan Vanian. Chatgpt and generative ai are booming, but the costs can be extraordinary, Apr 2023.
  40. Zachary Champion. Optimization could cut the carbon footprint of ai training by up to 75
  41. Aaron Mok. Chatgpt could cost over $700,000 per day to operate. microsoft is reportedly trying to make it cheaper., Apr 2023.
  42. Sarah McQuate. Q&A: UW researcher discusses just how much energy chatgpt uses, Jul 2023.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  44. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  45. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  46. Provable tradeoffs in adversarially robust classification. IEEE Transactions on Information Theory, 2023.
  47. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020.
  48. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  49. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  50. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
  51. Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020.
  52. Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247, 2020.
  53. Learning perturbation sets for robust machine learning. arXiv preprint arXiv:2007.08450, 2020.
  54. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020.
  55. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
  56. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  57. Probable domain generalization via quantile risk minimization. Advances in Neural Information Processing Systems, 35:17340–17358, 2022.
  58. Model-based domain generalization. Advances in Neural Information Processing Systems, 34:20210–20229, 2021.
  59. Do deep networks transfer invariances across classes? arXiv preprint arXiv:2203.09739, 2022.
  60. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 387–402. Springer, 2013.
  61. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  62. Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  63. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
  64. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.
  65. Adversarial training should be cast as a non-zero-sum game. arXiv preprint arXiv:2306.11035, 2023.
  66. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019.
  67. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pages 5286–5295. PMLR, 2018.
  68. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018.
  69. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693–10705. PMLR, 2020.
  70. Probabilistically robust learning: Balancing average and worst-case performance. In International Conference on Machine Learning, pages 18667–18686. PMLR, 2022.
  71. ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adversarial robustness certificates: a randomized smoothing approach. 2019.
  72. Certified defense to image transformations via randomized smoothing. Advances in Neural information processing systems, 33:8404–8417, 2020.
  73. Certified robustness to label-flipping attacks via randomized smoothing. In International Conference on Machine Learning, pages 8230–8241. PMLR, 2020.
  74. (de) randomized smoothing for certifiable defense against patch attacks. Advances in Neural Information Processing Systems, 33:6465–6475, 2020.
  75. Certified defences against adversarial patch attacks on semantic segmentation. arXiv preprint arXiv:2209.05980, 2022.
  76. Stability guarantees for feature attributions with multiplicative smoothing. arXiv preprint arXiv:2307.05902, 2023.
  77. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909, 2020.
  78. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020.
  79. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097, 2019.
  80. Natural language adversarial attack and defense in word level. arXiv preprint arXiv:1909.06723, 2019.
  81. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
  82. Combating adversarial misspellings with robust word recognition. arXiv preprint arXiv:1905.11268, 2019.
  83. Adversarial training with fast gradient projection method against synonym substitution based text attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13997–14005, 2021.
  84. Natural language adversarial defense through synonym encoding. In Uncertainty in Artificial Intelligence, pages 823–833. PMLR, 2021.
  85. Defense against synonym substitution-based adversarial attacks via dirichlet neighborhood ensemble. In Association for Computational Linguistics (ACL), 2021.
  86. Adversarial robustness with semi-infinite constrained learning. Advances in Neural Information Processing Systems, 34:6198–6215, 2021.
  87. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
  88. Adversarial autoaugment. arXiv preprint arXiv:1912.11188, 2019.
  89. Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33:14435–14447, 2020.
  90. Augmax: Adversarial composition of random augmentations for robust training. Advances in neural information processing systems, 34:237–250, 2021.
  91. Evaluating the adversarial robustness of adaptive test-time defenses. In International Conference on Machine Learning, pages 4421–4435. PMLR, 2022.
  92. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  93. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Alexander Robey (34 papers)
  2. Eric Wong (47 papers)
  3. Hamed Hassani (120 papers)
  4. George J. Pappas (208 papers)
Citations (170)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com