RESTOR: Knowledge Recovery in Machine Unlearning (2411.00204v3)
Abstract: LLMs trained on web-scale corpora can memorize undesirable data containing misinformation, copyrighted material, or private or sensitive information. Recently, several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints from trained models -- that is, to approximate a model that had never been trained on these datapoints in the first place. However, evaluating the effectiveness of unlearning algorithms remains an open challenge. Previous work has relied on heuristics -- such as verifying that the model can no longer reproduce the specific information targeted for removal while maintaining accuracy on unrelated test data. These approaches inadequately capture the complete effect of reversing the influence of datapoints on a trained model. In this work, we propose the RESTOR framework for machine unlearning evaluation, which assesses the ability of unlearning algorithms for targeted data erasure, by evaluating the ability of models to forget the knowledge introduced in these datapoints, while simultaneously recovering the model's knowledge state had it never encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate -- for instance, identifying that some algorithms merely emphasize forgetting but not recovering knowledge, and that localizing unlearning targets can enhance unlearning performance.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- George-Octavian Barbulescu and Peter Triantafillou. 2024. To each (textual sequence) its own: Improving memorized-data unlearning in large language models. arXiv preprint arXiv:2405.03097.
- Leace: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems, 36.
- Digital forgetting in large language models: A survey of unlearning methods. arXiv preprint arXiv:2404.02062.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
- Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
- Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504.
- Soul: Unlocking the power of second-order optimization for llm unlearning. arXiv preprint arXiv:2404.18239.
- Rwku: Benchmarking real-world knowledge unlearning for large language models. arXiv preprint arXiv:2406.10890.
- Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573.
- The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218.
- Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787.
- Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121.
- Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36. ArXiv:2202.05262.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
- Privacy risks of general-purpose language models. 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331.
- The frontier of data erasure: Machine unlearning for large language models. arXiv preprint arXiv:2403.15779.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- PÂ Rajpurkar. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
- Large language model unlearning. arXiv preprint arXiv:2310.10683.
- Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868.
- Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.