Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Defending Against Unforeseen Failure Modes with Latent Adversarial Training (2403.05030v4)

Published 8 Mar 2024 in cs.CR, cs.AI, and cs.LG

Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. LAT makes use of the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. Here, we use it to defend against failure modes without examples that elicit them. Specifically, we use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.

Exploring Latent Adversarial Training for Enhanced Model Robustness

Introduction

In the pursuit of advancing artificial intelligence, ensuring the robustness and reliability of AI systems, particularly in the face of adversarial inputs, remains a paramount challenge. Traditional approaches including adversarial training (AT) have aimed at enhancing model resilience but often fall short when confronted with unforeseen failure modes post-deployment. In response to these limitations, a novel approach, termed Latent Adversarial Training (LAT), has been introduced, leveraging the latent spaces of neural networks to fortify models against vulnerabilities without necessitating explicit examples of failure-triggering inputs. This exploration unfolds within image classification, text classification, and text generation domains, revealing that LAT generally surpasses conventional AT in maintaining performance on clean data while bolstering robustness against both trojans and novel classes of adversarial attacks.

Methodology

At its core, LAT diverges from the traditional AT by administering adversarial perturbations within the model's latent space rather than its input space. This distinction emerges from a recognition of the compressed, abstract nature of latent representations in machines, potentially enabling a broader and more effective defensive mechanism against a spectrum of unforeseen adversarial tactics. The experimentation conducted spans across multiple domains, wherein models were initially fine-tuned with poisoned data to incorporate trojans, succeeded by further fine-tuning under the regimes of LAT, AT, and random latent perturbations. These models were then evaluated on clean data, under novel adversarial conditions, and in the presence of trojans to assess the efficacy of LAT in comparation to existing practices.

Key Findings

The empirical evidence gathered through this paper elucidates several compelling insights. It was observed that LAT consistently enhances model robustness against novel adversarial attacks and trojans without compromising, and occasionally improving, performance on clean data. This suggests that LAT not only serves as a robust defensive tactic but also contributes to the overall model performance and reliability. Notably, these advantages were realized across varied tasks and models, reinforcing the potential of LAT as a universally applicable strategy for AI safety and reliability. However, it was also recognized that the selection of the appropriate latent layer for perturbation is crucial, indicating that further research into optimal layer selection could augment the utility of LAT.

Implications and Future Directions

The introduction and validation of LAT as a viable strategy for defending against unforeseen adversarial scenarios herald a significant stride in AI safety research. By shifting the focus from input space to latent space perturbations, LAT addresses the intrinsic challenge of predicting and preparing for the myriad of potential failure modes that may not be evident during model development. This approach not only enhances the robustness of models but also underscores the complexity and multidimensionality of securing AI systems against adversarial threats.

Future investigations could delve into refining the methodologies for latent layer selection, expanding the applicability of LAT across a broader spectrum of models and domains, and exploring the intersection of LAT with other defensive mechanisms. Additionally, the exploration of targeted adversarial attacks within the latent space presents an intriguing avenue for further research, potentially offering insights into model vulnerabilities and resilience in unprecedented detail.

Conclusion

The findings of this paper present a promising avenue towards fortifying AI models against the elusive and ever-evolving landscape of adversarial threats. Latent Adversarial Training emerges not only as a technique for enhancing model robustness but also as a catalyst for further exploration in the domain of AI safety. As we venture into increasingly complex and high-stakes applications of AI, the quest for robust, reliable models becomes ever more critical, with methodologies like LAT playing a pivotal role in realizing this objective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. AI Village at DEF CON. Ai village at def con announces largest-ever public generative ai red team, 2023. URL https://aivillage.org/generative%20red%20team/generative-red-team/.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Unrestricted adversarial examples. arXiv preprint arXiv:1809.08352, 2018.
  5. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  6. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
  7. Red teaming deep neural networks with feature synthesis tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  8. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
  9. Paul Christiano. Worst-case guarantees, 2019.
  10. Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
  11. DSIT. A pro-innovation approach to AI regulation. Technical report, August 2023. URL https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper.
  12. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120, 2023.
  13. EU. Artificial Intelligence Act, April 2021. URL https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52021PC0206.
  14. Stanislav Fort. Scaling laws for adversarial attacks on language model activations. arXiv preprint arXiv:2312.02780, 2023.
  15. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  16. Coercing llms to do and reveal (almost) anything, 2024.
  17. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  18. Attacking large language models with projected gradient descent, 2024.
  19. Corrective machine unlearning. arXiv preprint arXiv:2402.14015, 2024.
  20. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  22. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  23. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021.
  24. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
  25. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021b.
  26. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021c.
  27. Evan Hubinger. Relaxed adversarial training, Sept 2019.
  28. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
  29. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023a.
  30. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023b.
  31. Adam Jermyn. Latent adversarial training, June 2022.
  32. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
  33. Artprompt: Ascii art-based jailbreak attacks against aligned llms. arXiv preprint arXiv:2402.11753, 2024.
  34. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
  35. Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813, 2023.
  36. Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
  37. Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016, 2019.
  38. Making attention mechanisms more robust and interpretable with virtual adversarial training. Applied Intelligence, 53(12):15802–15817, 2023.
  39. Noam Kolt. Algorithmic black swans. Washington University Law Review, 101, 2023.
  40. Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
  41. Scale-invariant-fine-tuning (sift) for improved generalization in classification.
  42. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
  43. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
  44. Technical report for iccv 2021 challenge sslad-track3b: Transformers are better continual learners. arXiv preprint arXiv:2201.04924, 2022.
  45. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  46. Token-aware virtual adversarial training in natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8410–8418, 2021.
  47. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
  48. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
  49. Investigating bias representations in llama 2 chat via activation steering, 2024.
  50. Mechanistic mode connectivity. In International Conference on Machine Learning, pages 22965–23004. PMLR, 2023.
  51. Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968, 2023.
  52. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  53. NISSTC. Translation: Basic Safety Requirements for Generative Artificial Intelligence Services (Draft for Feedback), November 2023. URL https://cset.georgetown.edu/publication/china-safety-requirements-for-generative-ai/?utm_source=substack&utm_medium=email.
  54. NIST. Artificial intelligence risk management framework, 2023. URL https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.
  55. OpenAI. Openai red teaming network, September 2023. URL https://openai.com/blog/red-teaming-network.
  56. Improved text classification via contrastive adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11130–11138, 2022.
  57. Reliably fast adversarial training via latent adversarial perturbation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7758–7767, 2021.
  58. Fine-tuning enhances existing mechanisms: A case study on entity tracking, 2024.
  59. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  60. Towards speeding up adversarial training in latent spaces. arXiv preprint arXiv:2102.00662, 2021.
  61. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
  62. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
  63. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  64. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 – 252, 2014. URL https://api.semanticscholar.org/CorpusID:2930547.
  65. Weighted token-level virtual adversarial training in text classification. In 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), pages 117–123. IEEE, 2022.
  66. Regularizing deep networks using efficient layerwise adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  67. Adversarial attacks and defenses in large language models: Old and new threats. 2023.
  68. Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024.
  69. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022.
  70. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  71. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023.
  72. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  73. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  74. Harnessing the vulnerability of latent layers in adversarially trained models, 2019.
  75. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  76. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  77. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  78. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  79. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  80. A language model’s guide through latent space, 2024.
  81. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
  82. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  83. Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024.
  84. Unveiling the implicit toxicity in large language models. arXiv preprint arXiv:2311.17391, 2023.
  85. Backdoorbench: A comprehensive benchmark of backdoor learning. arXiv preprint arXiv:2206.12654, 2022.
  86. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  87. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
  88. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  89. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
  90. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.
  91. Adversarial machine learning in latent representations of neural networks. arXiv preprint arXiv:2309.17401, 2023.
  92. Adversarial training methods for deep learning: A systematic review. Algorithms, 15(8):283, 2022.
  93. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
  94. Adversarial training for high-stakes reliability, 2022.
  95. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  96. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Stephen Casper (40 papers)
  2. Lennart Schulze (2 papers)
  3. Oam Patel (6 papers)
  4. Dylan Hadfield-Menell (54 papers)
Citations (16)
Youtube Logo Streamline Icon: https://streamlinehq.com