Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM (2309.14348v3)

Published 18 Sep 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: Recently, LLMs have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source LLMs, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100% to around 10% or less.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  6. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp.  1310–1320. PMLR, 2019.
  7. Towards robustness against natural language word substitutions. arXiv preprint arXiv:2107.13541, 2021.
  8. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
  9. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp.  50–56. IEEE, 2018.
  10. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023.
  11. Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
  12. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.
  13. Baseline defenses for adversarial attacks against aligned language models, 2023.
  14. Certified robustness to adversarial word substitutions. arXiv preprint arXiv:1909.00986, 2019.
  15. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8018–8025, 2020.
  16. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
  17. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
  18. Pretraining language models with human preferences. In International Conference on Machine Learning, pp.  17506–17533. PMLR, 2023.
  19. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  20. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
  21. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
  22. Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  13525–13533, 2021.
  23. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
  24. Ha-Thanh Nguyen. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
  25. Ms marco: A human-generated machine reading comprehension dataset. 2016.
  26. OpenAI. Gpt-4 technical report, 2023.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  28. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.  1085–1097, 2019.
  29. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
  30. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  31. Robustness verification for transformers. arXiv preprint arXiv:2002.06622, 2020.
  32. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  33. Large language models in medicine. Nature medicine, pp.  1–11, 2023.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  36. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  38. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  39. Bloomberggpt: A large language model for finance, 2023.
  40. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17:151–178, 2020a.
  41. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079, 2020b.
  42. Automatic perturbation analysis for scalable certified robustness and beyond. Advances in Neural Information Processing Systems, 33:1129–1141, 2020c.
  43. Leapattack: Hard-label adversarial attack on text via gradient-based optimization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  2307–2315, 2022.
  44. Pat: Geometry-aware hard-label black-box adversarial attacks on text. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3093–3104, 2023.
  45. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  46. Certified robustness to text adversarial attacks by randomized [mask]. Computational Linguistics, 49(2):395–427, 2023.
  47. Defense against synonym substitution-based adversarial attacks via dirichlet neighborhood ensemble. In Association for Computational Linguistics (ACL), 2021.
  48. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
  49. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023.
  50. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Bochuan Cao (16 papers)
  2. Yuanpu Cao (11 papers)
  3. Lu Lin (54 papers)
  4. Jinghui Chen (50 papers)
Citations (101)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com