Applying sparse autoencoders to unlearn knowledge in language models (2410.19278v2)
Abstract: We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from LLMs. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it LLMs. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.
- Unlearning via rmu is mostly shallow, 2024. URL https://www.lesswrong.com/posts/6QYpXEscd8GuE7BgW/unlearning-via-rmu-is-mostly-shallow.
- Machine unlearning, 2020. URL https://arxiv.org/abs/1912.03817.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Sparse autoencoders find highly interpretable features in language models. 2023. URL https://arxiv.org/abs/2309.08600.
- Who’s harry potter? approximate unlearning in llms. 2023. URL https://arxiv.org/abs/2310.02238.
- Scaling and evaluating sparse autoencoders, 2024. URL https://arxiv.org/abs/2406.04093.
- Gemma Team. Gemma: Open models based on gemini research and technology, 2024a. URL https://arxiv.org/abs/2403.08295.
- Gemma Team. Gemma 2: Improving open language models at a practical size, 2024b. URL https://arxiv.org/abs/2408.00118.
- Managing catastrophic misuse without robust ais, 2024. URL https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, March 2023. ISSN 1557-7341. doi: 10.1145/3571730. URL http://dx.doi.org/10.1145/3571730.
- The WMDP benchmark: Measuring and reducing malicious use with unlearning, 2024. URL https://arxiv.org/abs/2403.03218.
- Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URL https://arxiv.org/abs/2408.05147.
- Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. URL https://arxiv.org/abs/2403.19647.
- Locating and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262.
- Pointer sentinel mixture models, 2016. URL https://arxiv.org/abs/1609.07843.
- Andrew Ng. Sparse autoencoder. CS294A Lecture Notes, 2011. Unpublished lecture notes.
- Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
- Improving alignment and robustness with circuit breakers, 2024. URL https://arxiv.org/abs/2406.04313.
- An adversarial perspective on machine unlearning for ai safety, 2024. URL https://arxiv.org/abs/2409.18025.
- Eoin Farrell (23 papers)
- Yeu-Tong Lau (3 papers)
- Arthur Conmy (22 papers)