Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating Adversarial Trigger Transfer in Large Language Models (2404.16020v2)

Published 24 Apr 2024 in cs.CL

Abstract: Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned LLMs. These triggers are believed to be highly transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not consistently transferable. We extensively investigate trigger transfer amongst 13 open models and observe poor and inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, April 2024. URL http://arxiv.org/abs/2404.02151. arXiv:2404.02151 [cs, stat].
  2. Many-shot Jailbreaking, April 2024. URL https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf.
  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. URL http://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs].
  4. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment, August 2023. URL http://arxiv.org/abs/2308.09662. arXiv:2308.09662 [cs].
  5. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gT5hALch9z.
  6. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  7. Jailbreaking Black Box Large Language Models in Twenty Queries, October 2023. URL http://arxiv.org/abs/2310.08419. arXiv:2310.08419 [cs].
  8. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  9. Deep reinforcement learning from human preferences, February 2023. URL http://arxiv.org/abs/1706.03741. arXiv:1706.03741 [cs, stat].
  10. Scaling Instruction-Finetuned Language Models, December 2022. URL http://arxiv.org/abs/2210.11416. arXiv:2210.11416 [cs].
  11. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
  12. Gemma: Open Models Based on Gemini Research and Technology, March 2024. URL http://arxiv.org/abs/2403.08295. arXiv:2403.08295 [cs].
  13. OLMo: Accelerating the Science of Language Models, February 2024. URL http://arxiv.org/abs/2402.00838. arXiv:2402.00838 [cs].
  14. The False Promise of Imitating Proprietary LLMs, May 2023. URL http://arxiv.org/abs/2305.15717. arXiv:2305.15717 [cs].
  15. What’s in Your ”Safe” Data?: Identifying Benign Data that Breaks Safety, April 2024. URL http://arxiv.org/abs/2404.01099. arXiv:2404.01099 [cs].
  16. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
  17. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh.
  18. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, December 2023. URL http://arxiv.org/abs/2312.06674. arXiv:2312.06674 [cs].
  19. Baseline Defenses for Adversarial Attacks Against Aligned Language Models, September 2023. URL http://arxiv.org/abs/2309.00614. arXiv:2309.00614 [cs].
  20. Certifying LLM Safety against Adversarial Prompting, February 2024. URL http://arxiv.org/abs/2309.02705. arXiv:2309.02705 [cs].
  21. Open Sesame! Universal Black Box Jailbreaking of Large Language Models, November 2023. URL http://arxiv.org/abs/2309.01446. arXiv:2309.01446 [cs].
  22. Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwpw4qKkb.
  23. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  24. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, February 2024. URL http://arxiv.org/abs/2402.04249. arXiv:2402.04249 [cs].
  25. Using In-Context Learning to Improve Dialogue Safety. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  11882–11910, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.796. URL https://aclanthology.org/2023.findings-emnlp.796.
  26. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically, February 2024. URL http://arxiv.org/abs/2312.02119. arXiv:2312.02119 [cs, stat].
  27. MosaicML NLP Team. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, 2023. URL www.mosaicml.com/blog/mpt-7b.
  28. GPT-4 Technical Report, March 2024. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
  29. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
  30. PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
  31. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked, October 2023. URL http://arxiv.org/abs/2308.07308. arXiv:2308.07308 [cs].
  32. On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  7595–7609, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.472. URL https://aclanthology.org/2023.emnlp-main.472.
  33. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hTEGyKf0dZ.
  34. Universal Jailbreak Backdoors from Poisoned Human Feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GxCGsxiAaK.
  35. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pp.  3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 978-1-4503-7998-4. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703. event-place: Virtual Event, CA, USA.
  36. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, November 2023. URL http://arxiv.org/abs/2310.03684. arXiv:2310.03684 [cs, stat].
  37. On the Exploitability of Instruction Tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=4AQ4Fnemox.
  38. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=p40XRfBX96.
  39. Stanford Alpaca: An Instruction-following LLaMA model, 2023. URL https://github.com/tatsu-lab/stanford_alpaca. Publication Title: GitHub repository.
  40. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
  41. Attention Is All You Need, August 2023. URL http://arxiv.org/abs/1706.03762. arXiv:1706.03762 [cs].
  42. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
  43. Poisoning Language Models During Instruction Tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  35413–35425. PMLR, July 2023. URL https://proceedings.mlr.press/v202/wan23b.html.
  44. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data, September 2023a. URL http://arxiv.org/abs/2309.11235. arXiv:2309.11235 [cs].
  45. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  46. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations, October 2023. URL http://arxiv.org/abs/2310.06387. arXiv:2310.06387 [cs].
  47. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020. URL http://arxiv.org/abs/1910.03771. arXiv:1910.03771 [cs].
  48. Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models, May 2023. URL http://arxiv.org/abs/2305.14710. arXiv:2305.14710 [cs].
  49. Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models, October 2023. URL http://arxiv.org/abs/2310.02949. arXiv:2310.02949 [cs].
  50. Low-Resource Languages Jailbreak GPT-4, January 2024. URL http://arxiv.org/abs/2310.02446. arXiv:2310.02446 [cs].
  51. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher, August 2023. URL http://arxiv.org/abs/2308.06463. arXiv:2308.06463 [cs].
  52. Removing RLHF Protections in GPT-4 via Fine-Tuning, November 2023. URL http://arxiv.org/abs/2311.05553. arXiv:2311.05553 [cs].
  53. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks, January 2024. URL http://arxiv.org/abs/2401.17263. arXiv:2401.17263 [cs].
  54. LIMA: Less Is More for Alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KBMOKmX2he.
  55. Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF, November 2023. URL https://starling.cs.berkeley.edu.
  56. Fine-Tuning Language Models from Human Preferences, January 2020. URL http://arxiv.org/abs/1909.08593. arXiv:1909.08593 [cs, stat].
  57. Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043 [cs].
Citations (4)

Summary

  • The paper demonstrates that triggers optimized with the Greedy Coordinate Gradient method do not reliably transfer across different LLMs.
  • Aligned by preference optimization models, such as Llama2 and Starling, exhibit strong resistance while fine-tuned models are notably vulnerable.
  • Generalization experiments reveal that triggers on fine-tuned models extend to new unsafe instructions, underscoring the need for rigorous safety evaluations.

This paper investigates the transferability of adversarial triggers for jailbreaking aligned LLMs. The authors empirically demonstrate that adversarial triggers generated using the Greedy Coordinate Gradient (GCG) method do not reliably transfer across different models, particularly those aligned by preference optimization.

The paper highlights the following key observations:

  1. Inconsistent Transferability: The paper challenges the notion of universal transferability of adversarial triggers. Through experiments on $13$ open models, the authors find that triggers optimized on one model often fail to jailbreak other models.
  2. Robustness of Aligned by Preference Optimization (APO) Models: Models aligned using preference optimization techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), exhibit significant robustness against adversarial triggers. Even triggers optimized directly on these models show limited transferability to other models. Examples of APO models used in the paper include Gemma, Llama2, and Starling.
  3. Vulnerability of Aligned by Fine-Tuning (AFT) Models: The research indicates that models aligned by fine-tuning (AFT) are more susceptible to adversarial triggers. Despite appearing safe on the surface by refusing unsafe instructions, AFT models can be easily jailbroken with adversarial triggers. Examples of AFT models used in the paper include Koala, Vicuna, and MPT-7B-Chat.
  4. Generalization to New Unsafe Instructions: Triggers optimized on AFT models demonstrate a capacity to generalize to new, unseen unsafe instructions across diverse domains, further emphasizing the vulnerability of these models.
  5. Experimental Setup: The paper uses examples from AdvBench for trigger optimization and evaluation. The Attack Success Rate (ASR) is defined as the proportion of instructions to which a model responds harmfully. The Δ\DeltaASR is the difference in ASRs obtained with and without appending the trigger to the input. Llama-Guard is used to detect whether triggers jailbreak models.
  6. Greedy Coordinate Gradient (GCG): The paper utilizes GCG, a white-box attack method, to find adversarial triggers. GCG uses the gradient to find token sequences that jailbreak LLMs by iteratively updating the trigger to minimize the cross-entropy loss of affirmative responses to harmful instructions.
  7. Models Used: The paper includes a range of open models, including Gemma (Instruct-2B and Instruct-7B), Guanaco (7B and 13B), Llama2 (7B-Chat and 13B-Chat), MPT-7B-Chat, OpenChat-3.5-7B, Starling-7B (α\alpha and β\beta), Vicuna (7B and 13B), and Koala-7B.

The authors conduct several experiments to assess trigger transferability:

  • Transfer from Existing Ensembles: The paper evaluates the transferability of triggers optimized using model ensembles from a previous work. The results show inconsistent transfer across models, with APO models being particularly resistant to jailbreaking.
  • Transfer Among APO Models: Triggers optimized on APO models are tested for transferability to other APO models. The paper finds limited transferability, even within the same model family.
  • Alignment by Fine-Tuning Analysis: The paper investigates the robustness of AFT models against adversarial triggers. AFT models are found to converge faster and be more susceptible to jailbreaking compared to APO models.
  • Instruction Generalization: The paper examines the generalization of triggers optimized on AdvBench to other safety benchmarks. The results indicate that triggers can generalize well to unseen unsafe instructions, particularly for AFT models.

The paper concludes that while AFT models may appear safe due to their ability to refuse unsafe instructions, they lack the adversarial robustness of APO models. The authors suggest that more comprehensive safety evaluations, including automated red-teaming, are needed to assess model robustness.

The paper acknowledges limitations, including the focus on a single attack method (GCG), the lack of evaluation of response relevance and helpfulness, and the potential for triggers to not fully converge within the allotted optimization time.

In summary, this paper provides empirical evidence against the universal transferability of adversarial triggers, highlighting the robustness of APO models and the vulnerability of AFT models. The findings underscore the need for more rigorous safety evaluations and the development of more robust alignment techniques.

Github Logo Streamline Icon: https://streamlinehq.com