Investigating Adversarial Trigger Transfer in Large Language Models (2404.16020v2)
Abstract: Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned LLMs. These triggers are believed to be highly transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not consistently transferable. We extensively investigate trigger transfer amongst 13 open models and observe poor and inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned LLMs.
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, April 2024. URL http://arxiv.org/abs/2404.02151. arXiv:2404.02151 [cs, stat].
- Many-shot Jailbreaking, April 2024. URL https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. URL http://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs].
- Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment, August 2023. URL http://arxiv.org/abs/2308.09662. arXiv:2308.09662 [cs].
- Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gT5hALch9z.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Jailbreaking Black Box Large Language Models in Twenty Queries, October 2023. URL http://arxiv.org/abs/2310.08419. arXiv:2310.08419 [cs].
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Deep reinforcement learning from human preferences, February 2023. URL http://arxiv.org/abs/1706.03741. arXiv:1706.03741 [cs, stat].
- Scaling Instruction-Finetuned Language Models, December 2022. URL http://arxiv.org/abs/2210.11416. arXiv:2210.11416 [cs].
- QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
- Gemma: Open Models Based on Gemini Research and Technology, March 2024. URL http://arxiv.org/abs/2403.08295. arXiv:2403.08295 [cs].
- OLMo: Accelerating the Science of Language Models, February 2024. URL http://arxiv.org/abs/2402.00838. arXiv:2402.00838 [cs].
- The False Promise of Imitating Proprietary LLMs, May 2023. URL http://arxiv.org/abs/2305.15717. arXiv:2305.15717 [cs].
- What’s in Your ”Safe” Data?: Identifying Benign Data that Breaks Safety, April 2024. URL http://arxiv.org/abs/2404.01099. arXiv:2404.01099 [cs].
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh.
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, December 2023. URL http://arxiv.org/abs/2312.06674. arXiv:2312.06674 [cs].
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models, September 2023. URL http://arxiv.org/abs/2309.00614. arXiv:2309.00614 [cs].
- Certifying LLM Safety against Adversarial Prompting, February 2024. URL http://arxiv.org/abs/2309.02705. arXiv:2309.02705 [cs].
- Open Sesame! Universal Black Box Jailbreaking of Large Language Models, November 2023. URL http://arxiv.org/abs/2309.01446. arXiv:2309.01446 [cs].
- Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwpw4qKkb.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, February 2024. URL http://arxiv.org/abs/2402.04249. arXiv:2402.04249 [cs].
- Using In-Context Learning to Improve Dialogue Safety. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 11882–11910, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.796. URL https://aclanthology.org/2023.findings-emnlp.796.
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically, February 2024. URL http://arxiv.org/abs/2312.02119. arXiv:2312.02119 [cs, stat].
- MosaicML NLP Team. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, 2023. URL www.mosaicml.com/blog/mpt-7b.
- GPT-4 Technical Report, March 2024. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
- Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked, October 2023. URL http://arxiv.org/abs/2308.07308. arXiv:2308.07308 [cs].
- On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7595–7609, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.472. URL https://aclanthology.org/2023.emnlp-main.472.
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hTEGyKf0dZ.
- Universal Jailbreak Backdoors from Poisoned Human Feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GxCGsxiAaK.
- DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pp. 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 978-1-4503-7998-4. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703. event-place: Virtual Event, CA, USA.
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, November 2023. URL http://arxiv.org/abs/2310.03684. arXiv:2310.03684 [cs, stat].
- On the Exploitability of Instruction Tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=4AQ4Fnemox.
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=p40XRfBX96.
- Stanford Alpaca: An Instruction-following LLaMA model, 2023. URL https://github.com/tatsu-lab/stanford_alpaca. Publication Title: GitHub repository.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
- Attention Is All You Need, August 2023. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs].
- Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
- Poisoning Language Models During Instruction Tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 35413–35425. PMLR, July 2023. URL https://proceedings.mlr.press/v202/wan23b.html.
- OpenChat: Advancing Open-source Language Models with Mixed-Quality Data, September 2023a. URL http://arxiv.org/abs/2309.11235. arXiv:2309.11235 [cs].
- Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations, October 2023. URL http://arxiv.org/abs/2310.06387. arXiv:2310.06387 [cs].
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020. URL http://arxiv.org/abs/1910.03771. arXiv:1910.03771 [cs].
- Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models, May 2023. URL http://arxiv.org/abs/2305.14710. arXiv:2305.14710 [cs].
- Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models, October 2023. URL http://arxiv.org/abs/2310.02949. arXiv:2310.02949 [cs].
- Low-Resource Languages Jailbreak GPT-4, January 2024. URL http://arxiv.org/abs/2310.02446. arXiv:2310.02446 [cs].
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher, August 2023. URL http://arxiv.org/abs/2308.06463. arXiv:2308.06463 [cs].
- Removing RLHF Protections in GPT-4 via Fine-Tuning, November 2023. URL http://arxiv.org/abs/2311.05553. arXiv:2311.05553 [cs].
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks, January 2024. URL http://arxiv.org/abs/2401.17263. arXiv:2401.17263 [cs].
- LIMA: Less Is More for Alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KBMOKmX2he.
- Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF, November 2023. URL https://starling.cs.berkeley.edu.
- Fine-Tuning Language Models from Human Preferences, January 2020. URL http://arxiv.org/abs/1909.08593. arXiv:1909.08593 [cs, stat].
- Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043 [cs].