Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization (2410.12949v2)

Published 16 Oct 2024 in cs.LG and cs.CL

Abstract: Methods for knowledge editing and unlearning in LLMs seek to edit or remove undesirable knowledge or capabilities without compromising general LLMing performance. This work investigates how mechanistic interpretability -- which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability -- can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the lookup-table mechanism for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.

Authors (5)

Phillip Guo (5 papers)
Aaquib Syed (3 papers)
Abhay Sheshadri (5 papers)
Aidan Ewart (5 papers)
Gintare Karolina Dziugaite (54 papers)

Summary

Overview of Mechanistic Unlearning Approaches

The paper "Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization" addresses the challenge of removing or altering specific knowledge in LLMs without degrading overall language performance. The focus is on utilizing mechanistic interpretability to improve the precision and robustness of these editing and unlearning processes.

Challenges and Methodology:

The authors identify significant distinctions between existing unlearning methodologies. They compare traditional output-tracing (OT) localization, which identifies components based on direct output preservation, with mechanistic approaches focusing on intermediate, high-level mechanisms. In particular, the paper emphasizes the fact lookup (FLU) mechanism as more effective for robust knowledge manipulation.

Key Findings:

Mechanistic and OT Localization: The research contrasts OT methods like Causal Tracing and Attribution Patching with mechanistic approaches. The latter uses manual analyses to localize components associated with factual recall in layers identified as responsible for enriching model latent states with attributes.
Robustness and Side Effects: Experimental results indicate that targeting FLU mechanisms leads to more robust unlearning/editing, minimizing unintended information retention. This is corroborated by reduced relearning and adaptability to varied prompts, compared to non-localized or OT methods.
Localized Fine-Tuning and Weight Masking: Evaluations are run using sport-related and counterfactual datasets to demonstrate unlearning effectiveness. Localized fine-tuning based on mechanistic interpretability consistently showed superior robustness to adversarial attacks and prompt variations.

Numerical Results:

The use of manual mechanistic localization resulted in substantial improvements, with accuracy on modified prompts (MCQ accuracy) significantly lower than with OT methods.
Retain accuracy remained high, indicating the preservation of general LLM capacity.

Implications and Speculations:

The paper suggests that mechanistic understanding of model components allows precise unlearning while safeguarding general capabilities. By focusing on the latent information source rather than the output pathways, mechanistic unlearning demonstrates enhanced robustness.

Looking forward, the approach invites exploration into further integration of machine interpretability with model editing processes, potentially leading to more secure, controllable AI systems. The framework also poses potential as a benchmark for interpretability tools, assisting in quantifying their real-world applicability and effectiveness in knowledge editing tasks.

Concluding Remarks:

This work provides a quantitative evaluation of knowledge unlearning and editing methodologies in LLMs, proposing a promising integration of mechanistic interpretability to enhance the precision and robustness of these operations. As the field progresses, such approaches could become standard practice for knowledge management in AI models, ensuring ethical, reliable deployments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/phuguo/status/1848177576235110623

https://twitter.com/StephenLCasper/status/1848265967970729996