Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. LAT makes use of the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. Here, we use it to defend against failure modes without examples that elicit them. Specifically, we use LAT to remove trojans and defend against held-out classes of adversarial attacks. We show in image classification, text classification, and text generation tasks that LAT usually improves both robustness to novel attacks and performance on clean data relative to AT. This suggests that LAT can be a promising tool for defending against failure modes that are not explicitly identified by developers.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- AI Village at DEF CON. Ai village at def con announces largest-ever public generative ai red team, 2023. URL https://aivillage.org/generative%20red%20team/generative-red-team/.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Unrestricted adversarial examples. arXiv preprint arXiv:1809.08352, 2018.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
- Red teaming deep neural networks with feature synthesis tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
- Paul Christiano. Worst-case guarantees, 2019.
- Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
- DSIT. A pro-innovation approach to AI regulation. Technical report, August 2023. URL https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper.
- Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120, 2023.
- EU. Artificial Intelligence Act, April 2021. URL https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52021PC0206.
- Stanislav Fort. Scaling laws for adversarial attacks on language model activations. arXiv preprint arXiv:2312.02780, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Coercing llms to do and reveal (almost) anything, 2024.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Attacking large language models with projected gradient descent, 2024.
- Corrective machine unlearning. arXiv preprint arXiv:2402.14015, 2024.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
- Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021b.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021c.
- Evan Hubinger. Relaxed adversarial training, Sept 2019.
- Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023a.
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023b.
- Adam Jermyn. Latent adversarial training, June 2022.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
- Artprompt: Ascii art-based jailbreak attacks against aligned llms. arXiv preprint arXiv:2402.11753, 2024.
- Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
- Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813, 2023.
- Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
- Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016, 2019.
- Making attention mechanisms more robust and interpretable with virtual adversarial training. Applied Intelligence, 53(12):15802–15817, 2023.
- Noam Kolt. Algorithmic black swans. Washington University Law Review, 101, 2023.
- Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
- Scale-invariant-fine-tuning (sift) for improved generalization in classification.
- A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
- Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
- Technical report for iccv 2021 challenge sslad-track3b: Transformers are better continual learners. arXiv preprint arXiv:2201.04924, 2022.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
- Token-aware virtual adversarial training in natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8410–8418, 2021.
- Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
- Investigating bias representations in llama 2 chat via activation steering, 2024.
- Mechanistic mode connectivity. In International Conference on Machine Learning, pages 22965–23004. PMLR, 2023.
- Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968, 2023.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- NISSTC. Translation: Basic Safety Requirements for Generative Artificial Intelligence Services (Draft for Feedback), November 2023. URL https://cset.georgetown.edu/publication/china-safety-requirements-for-generative-ai/?utm_source=substack&utm_medium=email.
- NIST. Artificial intelligence risk management framework, 2023. URL https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.
- OpenAI. Openai red teaming network, September 2023. URL https://openai.com/blog/red-teaming-network.
- Improved text classification via contrastive adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11130–11138, 2022.
- Reliably fast adversarial training via latent adversarial perturbation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7758–7767, 2021.
- Fine-tuning enhances existing mechanisms: A case study on entity tracking, 2024.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- Towards speeding up adversarial training in latent spaces. arXiv preprint arXiv:2102.00662, 2021.
- Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 – 252, 2014. URL https://api.semanticscholar.org/CorpusID:2930547.
- Weighted token-level virtual adversarial training in text classification. In 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), pages 117–123. IEEE, 2022.
- Regularizing deep networks using efficient layerwise adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Adversarial attacks and defenses in large language models: Old and new threats. 2023.
- Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024.
- Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
- Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
- Harnessing the vulnerability of latent layers in adversarially trained models, 2019.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- A language model’s guide through latent space, 2024.
- Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024.
- Unveiling the implicit toxicity in large language models. arXiv preprint arXiv:2311.17391, 2023.
- Backdoorbench: A comprehensive benchmark of backdoor learning. arXiv preprint arXiv:2206.12654, 2022.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
- A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
- Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.
- Adversarial machine learning in latent representations of neural networks. arXiv preprint arXiv:2309.17401, 2023.
- Adversarial training methods for deep learning: A systematic review. Algorithms, 15(8):283, 2022.
- Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
- Adversarial training for high-stakes reliability, 2022.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.