Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unforgettable Generalization in Language Models (2409.02228v1)

Published 3 Sep 2024 in cs.LG and cs.CL

Abstract: When LLMs (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357–2367, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245.
  2. LEACE: Perfect linear concept erasure in closed form. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/d066d21c619d0a78c5b557fa3291a8f4-Abstract-Conference.html.
  3. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7432–7439. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6239.
  4. Machine Unlearning. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pp.  141–159. IEEE, 2021. doi: 10.1109/SP40001.2021.00019. URL https://doi.org/10.1109/SP40001.2021.00019.
  5. Towards Making Systems Forget with Machine Unlearning. In 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pp.  463–480. IEEE Computer Society, 2015. doi: 10.1109/SP.2015.35. URL https://doi.org/10.1109/SP.2015.35.
  6. The grammar-learning trajectories of neural language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8281–8297, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.568. URL https://aclanthology.org/2022.acl-long.568.
  7. Can Bad Teaching Induce Forgetting? Unlearning in Deep Networks Using an Incompetent Teacher. In Brian Williams, Yiling Chen, and Jennifer Neville (eds.), Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp.  7210–7217. AAAI Press, 2023. doi: 10.1609/AAAI.V37I6.25879. URL https://doi.org/10.1609/aaai.v37i6.25879.
  8. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2924–2936, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  9. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv preprint, abs/1803.05457, 2018. URL https://arxiv.org/abs/1803.05457.
  10. Who’s Harry Potter? Approximate Unlearning in LLMs. ArXiv preprint, abs/2310.02238, 2023. URL https://arxiv.org/abs/2310.02238.
  11. James Alfred Ewing. On the production of transient electric currents in iron and steel conductors by twisting them when magnetised or by magnetising them when twisted. Proceedings of the Royal Society of London, 33(216-219):21–23, 1882.
  12. Fast Machine Unlearning without Retraining through Selective Synaptic Dampening. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pp.  12043–12051. AAAI Press, 2024. doi: 10.1609/AAAI.V38I11.29092. URL https://doi.org/10.1609/aaai.v38i11.29092.
  13. A framework for few-shot language model evaluation, 2023. URL https://zenodo.org/records/10256836.
  14. Making AI forget you: Data deletion in machine learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  3513–3526, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/cb79f8fa58b91d3af6c9c991f63962d3-Abstract.html.
  15. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp.  9301–9309. IEEE, 2020a. doi: 10.1109/CVPR42600.2020.00932. URL https://doi.org/10.1109/CVPR42600.2020.00932.
  16. Forgetting Outside the Box: Scrubbing Deep Networks of Information Accessible from Input-output Observations. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in Computer Science, pp.  383–398. Springer, 2020b. doi: 10.1007/978-3-030-58526-6“˙23. URL https://doi.org/10.1007/978-3-030-58526-6_23.
  17. Mixed-privacy forgetting in deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp.  792–801. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00085. URL https://openaccess.thecvf.com/content/CVPR2021/html/Golatkar_Mixed-Privacy_Forgetting_in_Deep_Networks_CVPR_2021_paper.html.
  18. Amnesiac machine learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp.  11516–11524. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17371.
  19. Certified data removal from machine learning models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  3832–3842. PMLR, 2020. URL http://proceedings.mlr.press/v119/guo20c.html.
  20. Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks. J. Mach. Learn. Res., 23:155:1–155:46, 2022. URL http://jmlr.org/papers/v23/21-0991.html.
  21. Let’s agree to agree: Neural networks share classification order on real datasets. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  3950–3960. PMLR, 2020. URL http://proceedings.mlr.press/v119/hacohen20a.html.
  22. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3309–3326, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.234. URL https://aclanthology.org/2022.acl-long.234.
  23. Recovering the Pre-Fine-tuning Weights of Generative Models. ArXiv preprint, abs/2402.10208, 2024. URL https://arxiv.org/abs/2402.10208.
  24. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj.
  25. Knowledge Unlearning for Mitigating Privacy Risks in Language Models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  14389–14408. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.805. URL https://doi.org/10.18653/v1/2023.acl-long.805.
  26. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2567–2577, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL https://aclanthology.org/D19-1259.
  27. Towards Unbounded Machine Unlearning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/062d711fb777322e2152435459e6e9d9-Abstract-Conference.html.
  28. The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. ArXiv preprint, abs/2403.03218, 2024. URL https://arxiv.org/abs/2403.03218.
  29. Rethinking Machine Unlearning for Large Language Models. ArXiv preprint, abs/2402.08787, 2024. URL https://arxiv.org/abs/2402.08787.
  30. Deep Unlearning via Randomized Conditionally Independent Hessians. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp.  10412–10421. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01017. URL https://doi.org/10.1109/CVPR52688.2022.01017.
  31. Insights on representational similarity in neural networks with canonical correlation. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  5732–5741, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/a7a3d70c6d17a73140918996d03c014f-Abstract.html.
  32. Descent-to-Delete: Gradient-based Methods for Machine Unlearning. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato (eds.), Algorithmic Learning Theory, 16-19 March 2021, Virtual Conference, Worldwide, volume 132 of Proceedings of Machine Learning Research, pp.  931–962. PMLR, 2021. URL http://proceedings.mlr.press/v132/neel21a.html.
  33. CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/5737c6ec2e0716f3d8a7a5c4e0de0d9a-Abstract-round2.html.
  34. In-context Unlearning: Language Models as Few Shot Unlearners. ArXiv preprint, abs/2310.07579, 2023. URL https://arxiv.org/abs/2310.07579.
  35. Efficient Benchmarking (of Language Models). ArXiv preprint, abs/2308.11696, 2023. URL https://arxiv.org/abs/2308.11696.
  36. Remember what you want to forget: Algorithms for machine unlearning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  18075–18086, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/9627c45df543c816a3ddf2d8ea686a99-Abstract.html.
  37. A statistical method for evaluating systematic relationships. 1958.
  38. Unrolling SGD: Understanding Factors Influencing Machine Unlearning. In 7th IEEE European Symposium on Security and Privacy, EuroS&P 2022, Genoa, Italy, June 6-10, 2022, pp.  303–319. IEEE, 2022. doi: 10.1109/EUROSP53844.2022.00027. URL https://doi.org/10.1109/EuroSP53844.2022.00027.
  39. Llama 2: Open Foundation and Fine-tuned Chat Models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.org/abs/2307.09288.
  40. Superglue: A stickier benchmark for general-purpose language understanding systems. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  3261–3275, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
  41. KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13264–13276. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.740. URL https://doi.org/10.18653/v1/2023.acl-long.740.
  42. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pp.  94–106, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URL https://aclanthology.org/W17-4413.
  43. ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization. ArXiv preprint, abs/2311.13171, 2023. URL https://arxiv.org/abs/2311.13171.
  44. Large Language Model Unlearning. ArXiv preprint, abs/2310.10683, 2023. URL https://arxiv.org/abs/2310.10683.
  45. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Eric Zhang (12 papers)
  2. Leshem Chosen (2 papers)
  3. Jacob Andreas (116 papers)

Summary

Unforgettable Generalization in LLMs

The paper "Unforgettable Generalization in LLMs" by Eric Zhang, Leshem Choshen, and Jacob Andreas investigates how fine-tuning LLMs (LMs) with randomized labels for specific tasks affects their ability to "forget" learned capabilities. The key contributions of this paper lie in studying the generalization behavior of forgetting tasks and examining whether such forgetting truly removes knowledge or simply alters the model's surface behavior.

Summary of Key Findings

The authors perform an extensive set of experiments using transformer-based LMs, particularly focusing on the LLaMA2-7B model, to analyze the effects of random-label fine-tuning across a variety of tasks. They introduce two key metrics to quantify forgetting: the "forget gap" and the "forget ratio," allowing a rigorous assessment of how well models forget tasks. Their primary results can be summarized as follows:

  1. Task-Dependent Forgetting:
    • The degree of task forgetting is significantly varied. Forgetting generalizes robustly in some tasks, such as entailment classification, where models produce uninformative predictions on new task instances.
    • In contrast, tasks requiring physical commonsense reasoning or scientific question answering exhibit limited generalization in forgetting; models trained on random labels still perform well on similar unseen examples from these tasks.
  2. Independence from Data Set Difficulty:
    • The efficacy of forgetting does not correlate with the difficulty of the task dataset. Specific tasks, regardless of their difficulty, showed a stronger tendency to retain learned knowledge, highlighting the complex nature of task-specific behavior during the forgetting process.
  3. Predictors of Forgetting Generalization:
    • Forgetting generalization is weakly predicted by the confidence levels of the LM's initial task predictions and the variability of LM representations of the training data. Tasks with lower confidence and lower variability in representations tend to forget more effectively.
  4. Cross-Task Forgetting:
    • There is a noticeable variability in cross-task forgetting. For example, fine-tuning on science questions with random labels caused those models to retain their capability to answer new science questions but fail more thoroughly on entailment classification.
  5. Shallow Forgetting:
    • Even when forgetting generalizes, it appears to be shallow. Linear probes trained on the LMs' representations post-forgetting can still perform the tasks reliably, indicating that the underlying knowledge is not completely eradicated.

Implications and Future Directions

These findings have profound implications for the broader objective of targeted unlearning in LMs. The variability in forgetting across different tasks and the shallowness of the forgetting observed suggest that current fine-tuning methodologies might not be sufficient for robust and reliable forgetting. This challenge points to several future research directions:

  1. Robust Unlearning Techniques:
    • The field needs more sophisticated unlearning techniques that can achieve deeper forgetting without merely suppressing surface behaviors. Approaches could involve fundamentally altering model structures or developing new training paradigms that prioritize depth in forgetting.
  2. Predictive Metrics for Forgetting:
    • Further paper on predictive metrics such as model confidence and variability in data representation could lead to more effective unlearning processes. By better understanding these characteristics, researchers can tailor fine-tuning regimens to specific tasks and contexts.
  3. Implications for Model Safety and Ethics:
    • Effective unlearning has important implications for model safety and ethics, particularly in eliminating undesirable capabilities like generating harmful content. Robust forgetting techniques could enhance user trust and compliance with ethical standards.
  4. Cross-Model and Multi-Task Studies:
    • Extending this work to other models and exploring how task forgetting generalizes in multi-task settings can provide a more comprehensive understanding. In the paper, results were consistent across models like GPT-J-6B and GPT-2, yet a broader comparison could yield additional insights.
  5. Real-World Applicability:
    • Considering practical implications, future research should test forgetting mechanisms in real-world systems to ensure that theoretical benefits translate to operational improvements.

Conclusion

The paper "Unforgettable Generalization in LLMs" offers a detailed examination of the challenges associated with making LMs forget specific capabilities. The nuanced results underscore the complexity of forgetting and highlight the limitations of current fine-tuning practices. While the findings reveal inconsistencies and shallow forgetting, they also open avenues for future techniques that could more comprehensively and robustly address the unlearning process. This work forms a critical foundation in the ongoing endeavor to make AI systems safer and more ethically aligned.

Youtube Logo Streamline Icon: https://streamlinehq.com