Do Unlearning Methods Remove Information from Language Model Weights? (2410.08827v2)

Published 11 Oct 2024 in cs.LG

Abstract: LLMs' knowledge of how to perform cyber-security attacks, create bioweapons, and manipulate humans poses risks of misuse. Previous work has proposed methods to unlearn this knowledge. Historically, it has been unclear whether unlearning techniques are removing information from the model weights or just making it harder to access. To disentangle these two objectives, we propose an adversarial evaluation method to test for the removal of information from model weights: we give an attacker access to some facts that were supposed to be removed, and using those, the attacker tries to recover other facts from the same distribution that cannot be guessed from the accessible facts. We show that using fine-tuning on the accessible facts can recover 88% of the pre-unlearning accuracy when applied to current unlearning methods, revealing the limitations of these methods in removing information from the model weights.

PDF HTML Abstract

Evaluation of Unlearning Methods in LLMs

The paper "Do Unlearning Methods Remove Information from LLM Weights?" tackles a critical issue in the domain of LLMs: the potential misuse of information embedded in model weights, such as knowledge of cyber-security attacks or bioweapons. Current unlearning techniques claim to mitigate these risks by either removing sensitive knowledge or making it harder to access. This paper assesses whether these unlearning techniques effectively remove information from model weights or merely obscure it.

Main Contributions

The authors propose an innovative adversarial evaluation framework to test the efficiency of unlearning methods. This framework attempts to recover information from model weights that were intended to be unlearned. By using accessible facts related to the unlearned knowledge, an attacker can use fine-tuning to gauge how much pre-unlearning accuracy can be recovered. The finding that 88% of the original accuracy can be reclaimed by fine-tuning on accessible data underscores the limitations of current unlearning strategies like RMU, Gradient Difference, and Random Incorrect Answer (RIA).

Methodology

The paper rigorously evaluates the efficacy of unlearning methods using a series of experiments across multiple datasets, each designed to minimize shared information among facts:

Years Dataset: Historical events and their occurrence years.
MMLU Subsets: Diverse categories with minimal overlap in knowledge.
WMDP-Deduped: A refined version of the WMDP dataset, intentionally filtered to reduce information overlap.
Random Birthdays: Randomly generated names and birth years to ensure independence between facts.

The researchers perform unlearning using various techniques and assess the residual knowledge using their Retraining on T (RTT) method.

Results and Discussion

The evaluation reveals that current unlearning techniques primarily hide rather than remove information, as the RTT recovered a substantial portion of pre-unlearning capabilities. Even under configurations where the model forgets much of the sensitive knowledge, RTT demonstrates the limitations of these methods by recovering much of the original knowledge.

The implications are significant for AI safety and governance: merely obstructing access to sensitive information isn't sufficient. The propensity for sophisticated retrieval methods to expose obscured knowledge suggests a need for more robust unlearning strategies that effectively nullify the underlying information.

Recommendations

For future research in AI safety:

Clearly differentiate whether proposed methods aim to remove knowledge or merely limit its accessibility.
Include adversarial evaluation, such as RTT, to assess the robustness of unlearning techniques.
Openly release models and code for peer evaluation, fostering reproducibility and collaborative improvement in unlearning methods.

Conclusion

The authors conclude with a call to refine unlearning techniques in order to offer stronger guarantees in terms of knowledge removal, rather than relying on the insufficiency of current methods that obscure rather than obliterate sensitive information. This research lays a crucial foundation for developing safer AI systems with robust mechanisms to handle potentially hazardous embedded knowledge.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Aghyad Deeb (2 papers)
Fabien Roger (12 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/StephenLCasper/status/1846766306101809167

https://twitter.com/FabienDRoger/status/1845833434909868436

https://twitter.com/SamuelAlbanie/status/1845910747999600953

https://twitter.com/_robertkirk/status/1878057448390480042

https://twitter.com/aghyadd98/status/1888958575483232684

https://twitter.com/dhadfieldmenell/status/1916503786790469683