Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Applying sparse autoencoders to unlearn knowledge in language models (2410.19278v2)

Published 25 Oct 2024 in cs.LG and cs.AI

Abstract: We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from LLMs. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it LLMs. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Unlearning via rmu is mostly shallow, 2024. URL https://www.lesswrong.com/posts/6QYpXEscd8GuE7BgW/unlearning-via-rmu-is-mostly-shallow.
  2. Machine unlearning, 2020. URL https://arxiv.org/abs/1912.03817.
  3. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  4. Sparse autoencoders find highly interpretable features in language models. 2023. URL https://arxiv.org/abs/2309.08600.
  5. Who’s harry potter? approximate unlearning in llms. 2023. URL https://arxiv.org/abs/2310.02238.
  6. Scaling and evaluating sparse autoencoders, 2024. URL https://arxiv.org/abs/2406.04093.
  7. Gemma Team. Gemma: Open models based on gemini research and technology, 2024a. URL https://arxiv.org/abs/2403.08295.
  8. Gemma Team. Gemma 2: Improving open language models at a practical size, 2024b. URL https://arxiv.org/abs/2408.00118.
  9. Managing catastrophic misuse without robust ais, 2024. URL https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais.
  10. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, March 2023. ISSN 1557-7341. doi: 10.1145/3571730. URL http://dx.doi.org/10.1145/3571730.
  11. The WMDP benchmark: Measuring and reducing malicious use with unlearning, 2024. URL https://arxiv.org/abs/2403.03218.
  12. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URL https://arxiv.org/abs/2408.05147.
  13. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. URL https://arxiv.org/abs/2403.19647.
  14. Locating and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262.
  15. Pointer sentinel mixture models, 2016. URL https://arxiv.org/abs/1609.07843.
  16. Andrew Ng. Sparse autoencoder. CS294A Lecture Notes, 2011. Unpublished lecture notes.
  17. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
  18. Improving alignment and robustness with circuit breakers, 2024. URL https://arxiv.org/abs/2406.04313.
  19. An adversarial perspective on machine unlearning for ai safety, 2024. URL https://arxiv.org/abs/2409.18025.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Eoin Farrell (23 papers)
  2. Yeu-Tong Lau (3 papers)
  3. Arthur Conmy (22 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com