Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM (2309.14348v3)
Abstract: Recently, LLMs have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source LLMs, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100% to around 10% or less.
- Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310–1320. PMLR, 2019.
- Towards robustness against natural language word substitutions. arXiv preprint arXiv:2107.13541, 2021.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
- Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. IEEE, 2018.
- Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023.
- Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
- Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.
- Baseline defenses for adversarial attacks against aligned language models, 2023.
- Certified robustness to adversarial word substitutions. arXiv preprint arXiv:1909.00986, 2019.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 8018–8025, 2020.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
- Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pp. 17506–17533. PMLR, 2023.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
- Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 13525–13533, 2021.
- Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
- Ha-Thanh Nguyen. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
- Ms marco: A human-generated machine reading comprehension dataset. 2016.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1085–1097, 2019.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Robustness verification for transformers. arXiv preprint arXiv:2002.06622, 2020.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Large language models in medicine. Nature medicine, pp. 1–11, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
- Bloomberggpt: A large language model for finance, 2023.
- Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17:151–178, 2020a.
- Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079, 2020b.
- Automatic perturbation analysis for scalable certified robustness and beyond. Advances in Neural Information Processing Systems, 33:1129–1141, 2020c.
- Leapattack: Hard-label adversarial attack on text via gradient-based optimization. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2307–2315, 2022.
- Pat: Geometry-aware hard-label black-box adversarial attacks on text. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3093–3104, 2023.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- Certified robustness to text adversarial attacks by randomized [mask]. Computational Linguistics, 49(2):395–427, 2023.
- Defense against synonym substitution-based adversarial attacks via dirichlet neighborhood ensemble. In Association for Computational Linguistics (ACL), 2021.
- Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Bochuan Cao (16 papers)
- Yuanpu Cao (11 papers)
- Lu Lin (54 papers)
- Jinghui Chen (50 papers)