Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing the Robustness of Large Language Models through Self-Denoised Smoothing (2404.12274v1)

Published 18 Apr 2024 in cs.CL and cs.AI

Abstract: Although LLMs have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at https://github.com/UCSB-NLP-Chang/SelfDenoise

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550.
  2. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  3. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning.
  4. Black-box generation of adversarial text sequences to evade deep learning classifiers. 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56.
  5. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757.
  6. Textgrad: Advancing robustness evaluation in nlp by gradient-driven optimization. In The Eleventh International Conference on Learning Representations.
  7. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In AAAI.
  8. Query-efficient and scalable black-box adversarial attacks on discrete sequential data via bayesian optimization. In International Conference on Machine Learning, pages 12478–12497. PMLR.
  9. Tight certificates of adversarial robustness for randomly smoothed classifiers. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  10. Contextualized perturbation for textual adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5053–5069.
  11. Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. arXiv preprint arXiv:2306.06615.
  12. Textbugger: Generating adversarial text against real-world applications. ArXiv, abs/1812.05271.
  13. Using adversarial attacks to reveal the statistical bias in machine reading comprehension models. arXiv preprint arXiv:2105.11136.
  14. Towards deep learning models resistant to adversarial attacks. In ArXiv, volume abs/1706.06083.
  15. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  16. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  17. Denoised smoothing: A provable defense for pretrained classifiers. arXiv: Learning.
  18. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  19. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  20. Galactica: A large language model for science. ArXiv, abs/2211.09085.
  21. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
  22. T3: Tree-autoencoder constrained adversarial text generation for targeted attack. arXiv preprint arXiv:1912.10375.
  23. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  24. Certified robustness to word substitution attack with differential privacy. In NAACL.
  25. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  26. A large language model for electronic health records. NPJ Digital Medicine, 5.
  27. Safer: A structure-free approach for certified robustness to adversarial word substitutions. In Annual Meeting of the Association for Computational Linguistics.
  28. Certified robustness to text adversarial attacks by randomized [mask]. In arXiv preprint arXiv:2105.03743.
  29. Theoretically principled trade-off between robustness and accuracy. In ICML, pages 7472–7482. PMLR.
  30. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
  31. Robust mixture-of-expert training for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 90–101.
  32. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In ICML.
  33. How to robustify black-box ML models? a zeroth-order optimization perspective. In International Conference on Learning Representations.
  34. Certified robustness against natural language attacks by causal intervention. In International Conference on Machine Learning.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  36. Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations.
  37. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jiabao Ji (13 papers)
  2. Bairu Hou (14 papers)
  3. Zhen Zhang (384 papers)
  4. Guanhua Zhang (24 papers)
  5. Wenqi Fan (78 papers)
  6. Qing Li (430 papers)
  7. Yang Zhang (1129 papers)
  8. Gaowen Liu (60 papers)
  9. Sijia Liu (204 papers)
  10. Shiyu Chang (120 papers)
Citations (3)