An Adversarial Perspective on Machine Unlearning for AI Safety (2409.18025v3)

Published 26 Sep 2024 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: LLMs are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

Authors (6)

Jakub Łucki (3 papers)
Boyi Wei (10 papers)
Yangsibo Huang (40 papers)
Peter Henderson (67 papers)
Florian Tramèr (87 papers)
Javier Rando (21 papers)

Citations (12)

View on Semantic Scholar

Summary

An Adversarial Perspective on Machine Unlearning for AI Safety

The paper "An Adversarial Perspective on Machine Unlearning for AI Safety" by Jakub Łucki et al. explores the effectiveness and reliability of machine unlearning mechanisms in contrast to traditional safety finetuning measures in LLMs. The central goal of unlearning is to remove specific harmful knowledge, such as hazardous material information or offensive content, from LLMs, thereby mitigating risk even under adversarial conditions. This paper particularly scrutinizes the robustness claims of state-of-the-art unlearning methods, such as Representation Misdirection for Unlearning (RMU), through various sophisticated adversarial techniques.

Introduction

LLMs, due to their extensive training on vast corpora scraped from the internet, invariably embed unsafe or harmful knowledge. Standard safety finetuning techniques, such as Direct Preference Optimization (DPO), aim to align model behaviors with ethical guidelines but are occasionally susceptible to simple adversarial attacks known as jailbreaks. Unlearning methods, such as RMU, propose a more definitive solution by attempting to excise this hazardous information entirely from the model weights.

Key Findings and Methodology

The empirical evaluations in this paper are rigorous and multifaceted, involving white-box adversarial methods to test the strength of unlearning defenses. Specifically, the researchers applied various adversarial techniques, including:

Finetuning: Surprisingly, finetuning unlearned models with data sets that have minimal overlap with the hazardous content effectively reverses unlearning. For example, finetuning RMU on just 10 unrelated samples recovered a majority of the hazardous capabilities and attained 62.4% accuracy on WMDP-Bio, showcasing the vulnerability of unlearning methods to seemingly benign finetuning.
Orthogonalization: The researchers discovered that directional ablation techniques, which remove specific vectors in the activation space, successfully bypass the protections of RMU. Employing orthogonalization increased RMU's performance on WMDP-Bio to 64.7% accuracy, indicating that the hazardous knowledge isn't entirely excised but merely obfuscated.
Logit Lens Projections: Using Logit Lens to decode intermediate representations revealed that unlearning methods indeed manage to obscure hazardous knowledge from early layers, but this does not hold as strongly in deeper layers. The Logit Lens accuracy for RMU across the architecture nears random chance, reinforcing that some unlearning efforts are more thorough than safety training.
Enhanced GCG: An adapted Global Consistency Generation (GCG) method designed for unlearning methods manifested significant efficacy, recovering RMU's WMDP-Bio accuracy to 53.9%. This underscores that even tailored adversarial attacks can challenge state-of-the-art unlearning methods.

Implications and Future Directions

These findings suggest that current unlearning techniques, despite their innovative approaches, may not offer substantial advantages over traditional safety finetuning when subjected to sophisticated adversarial testing. The practical implications are immense for fields necessitating stringent information safety like medical or legal domains where misuse of specific information could result in severe consequences. Furthermore, the research highlights the necessity for an evolution in evaluation methods, moving past black-box assessments toward more holistic adversarial testing frameworks to thoroughly vet unlearning techniques.

Prospective research could delve into improving the security of unlearning algorithms by integrating mechanisms that guard against rapid re-learnability of hazardous knowledge. For example, combining gradient ascent methods with more robust representation engineering might yield more resilient unlearning techniques. Additionally, developing adaptive evaluations that dynamically test and strengthen model weaknesses could form the bedrock of next-generation unlearning strategies.

Conclusion

In conclusion, "An Adversarial Perspective on Machine Unlearning for AI Safety" provides pivotal insights into the efficacy and limitations of contemporary unlearning methods. The extensive evaluations reveal that while unlearning methods like RMU can obscure knowledge, they often falter under adversarial scrutiny, sometimes exhibiting vulnerabilities akin to those of safety training measures. This work calls for a more robust and adaptive assessment approach, encouraging the development of more resilient unlearning strategies that can more effectively prevent hazardous knowledge extraction from AI models.