Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs (2407.15549v2)

Published 22 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (158)
  1. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  2. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  3. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
  4. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023a.
  5. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2023a.
  6. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023a.
  7. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023.
  8. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
  9. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  10. Simplesafetytests: a test suite for identifying critical safety risks in large language models. arXiv preprint arXiv:2311.08370, 2023.
  11. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024.
  12. Artprompt: Ascii art-based jailbreak attacks against aligned llms. arXiv preprint arXiv:2402.11753, 2024.
  13. Coercing llms to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020, 2024.
  14. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. arXiv preprint arXiv:2403.17336, 2024.
  15. Play guessing game with llm: Indirect jailbreak attack with implicit clues. arXiv preprint arXiv:2402.09091, 2024.
  16. Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679, 2024.
  17. Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309, 2024.
  18. Many-shot jailbreaking. 2024.
  19. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
  20. Adversarial training for high-stakes reliability. Advances in Neural Information Processing Systems, 35:9274–9286, 2022.
  21. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
  22. Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932, 2024.
  23. International Scientific Report on the Safety of Advanced AI. PhD thesis, Department for Science, Innovation and Technology, 2024.
  24. Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
  25. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023a.
  26. Mechanistic mode connectivity. In International Conference on Machine Learning, pages 22965–23004. PMLR, 2023.
  27. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811.
  28. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023.
  29. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
  30. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv preprint arXiv:2402.05162, 2024a.
  31. Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024.
  32. Open the pandora’s box of llms: Jailbreaking llms through representation engineering. arXiv preprint arXiv:2401.06824, 2024a.
  33. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
  34. Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
  35. Technical report for iccv 2021 challenge sslad-track3b: Transformers are better continual learners. arXiv preprint arXiv:2201.04924, 2022.
  36. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022.
  37. Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968, 2023.
  38. Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
  39. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  40. Rethinking llm memorization through the lens of adversarial compression. arXiv preprint arXiv:2404.15146, 2024.
  41. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  42. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  43. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv preprint arXiv:2310.14303, 2023.
  44. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
  45. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
  46. Language models resist alignment, 2024.
  47. Safety alignment should be made more than just a few tokens deep, 2024.
  48. Jogging the memory of unlearned model through targeted relearning attack. arXiv preprint arXiv:2406.13356, 2024.
  49. Covert malicious finetuning: Challenges in safeguarding llm adaptation. In Forty-first International Conference on Machine Learning.
  50. Regularizing deep networks using efficient layerwise adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  51. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024.
  52. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  53. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  54. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
  55. Who’s harry potter? approximate unlearning in llms, 2023a.
  56. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
  57. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024b.
  58. Harnessing the vulnerability of latent layers in adversarially trained models, 2019.
  59. Reliably fast adversarial training via latent adversarial perturbation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7758–7767, 2021.
  60. Towards speeding up adversarial training in latent spaces. arXiv preprint arXiv:2102.00662, 2021.
  61. Adversarial machine learning in latent representations of neural networks. arXiv preprint arXiv:2309.17401, 2023a.
  62. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
  63. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
  64. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
  65. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  66. Scale-invariant-fine-tuning (sift) for improved generalization in classification.
  67. Token-aware virtual adversarial training in natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8410–8418, 2021.
  68. Weighted token-level virtual adversarial training in text classification. In 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), pages 117–123. IEEE, 2022.
  69. Improved text classification via contrastive adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11130–11138, 2022.
  70. Adversarial attacks and defenses in large language models: Old and new threats. 2023.
  71. Attacking large language models with projected gradient descent, 2024.
  72. Stanislav Fort. Scaling laws for adversarial attacks on language model activations. arXiv preprint arXiv:2312.02780, 2023.
  73. Making attention mechanisms more robust and interpretable with virtual adversarial training. Applied Intelligence, 53(12):15802–15817, 2023.
  74. Efficient adversarial training in llms with continuous attacks. arXiv preprint arXiv:2405.15589, 2024.
  75. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  76. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023b.
  77. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
  78. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  79. Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813, 2023.
  80. Investigating bias representations in llama 2 chat via activation steering, 2024.
  81. A language model’s guide through latent space, 2024.
  82. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024.
  83. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023b.
  84. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  85. Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024.
  86. Evaluating the instruction-following robustness of large language models to prompt injection. 2023c.
  87. Revisiting adversarial training at scale. arXiv preprint arXiv:2401.04727, 2024.
  88. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023b.
  89. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479, 2023.
  90. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  91. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024b.
  92. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  93. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  94. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  95. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  96. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023b.
  97. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024.
  98. Machine unlearning fails to remove data poisoning attacks. arXiv preprint arXiv:2406.17216, 2024.
  99. Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
  100. Poisoning language models during instruction tuning. In International Conference on Machine Learning, pages 35413–35425. PMLR, 2023.
  101. Composite backdoor attacks against large language models. arXiv preprint arXiv:2310.07676, 2023.
  102. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455, 2023.
  103. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
  104. Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv preprint arXiv:2402.11208, 2024.
  105. Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787, 2024a.
  106. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  107. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
  108. Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573, 2022.
  109. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023.
  110. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150, 2023.
  111. Knowledge sanitization of large language models. arXiv preprint arXiv:2309.11852, 2023.
  112. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6032–6048, 2023.
  113. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
  114. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138, 2023.
  115. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023b.
  116. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  117. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024.
  118. Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge. arXiv preprint arXiv:2404.05880, 2024.
  119. Towards adversarial evaluations for inexact machine unlearning. arXiv preprint arXiv:2201.06640, 2022.
  120. Large language models relearn removed concepts. arXiv preprint arXiv:2401.01814, 2024.
  121. Offset unlearning for large language models. arXiv preprint arXiv:2404.11045, 2024.
  122. Towards safer large language models through machine unlearning. arXiv preprint arXiv:2402.10058, 2024b.
  123. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  124. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  125. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  126. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
  127. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  128. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  129. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  130. Enhancing chat language models by scaling high-quality instructional conversations, 2023.
  131. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  132. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  133. Haizelabs. Haizelabs/llama3-jailbreak: A trivial programmatic llama 3 jailbreak. sorry zuck! URL https://github.com/haizelabs/llama3-jailbreak?v=2.
  134. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  135. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024.
  136. Red teaming deep neural networks with feature synthesis tools. Advances in Neural Information Processing Systems, 36:80470–80516, 2023a.
  137. Competition report: Finding universal jailbreak backdoors in aligned llms. arXiv preprint arXiv:2404.14461, 2024.
  138. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  139. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  140. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023b.
  141. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  142. Pointer sentinel mixture models, 2016.
  143. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  144. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  145. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  146. Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023c.
  147. Guardrail baselines for unlearning in llms. arXiv preprint arXiv:2403.03329, 2024.
  148. Jailbreaking is best solved by definition. arXiv preprint arXiv:2403.14725, 2024.
  149. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023.
  150. Sophon: Non-fine-tunable learning to restrain task transferability for pre-trained models. arXiv preprint arXiv:2404.12699, 2024.
  151. Toward robust unlearning for llms. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024.
  152. Representation noising effectively prevents harmful fine-tuning on llms. arXiv preprint arXiv:2405.14577, 2024.
  153. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
  154. Improving alignment and robustness with circuit breakers, 2024. URL https://arxiv.org/abs/2406.04313.
  155. Heidy Khlaaf. Toward comprehensive risk assessments and assurance of ai-based systems. Trail of Bits, 2023.
  156. Who’s harry potter? approximate unlearning in llms. ArXiv, abs/2310.02238, 2023b. URL https://api.semanticscholar.org/CorpusID:263608437.
  157. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  158. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Abhay Sheshadri (5 papers)
  2. Aidan Ewart (5 papers)
  3. Phillip Guo (5 papers)
  4. Aengus Lynch (8 papers)
  5. Cindy Wu (3 papers)
  6. Vivek Hebbar (2 papers)
  7. Henry Sleight (10 papers)
  8. Asa Cooper Stickland (15 papers)
  9. Ethan Perez (55 papers)
  10. Dylan Hadfield-Menell (54 papers)
  11. Stephen Casper (40 papers)
Citations (14)