Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Machine Unlearning for Large Language Models (2402.08787v6)

Published 13 Feb 2024 in cs.LG and cs.CL

Abstract: We explore machine unlearning (MU) in the domain of LLMs, referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. We envision LLM unlearning becoming a pivotal element in the life-cycle management of LLMs, potentially standing as an essential foundation for developing generative AI that is not only safe, secure, and trustworthy, but also resource-efficient without the need of full retraining. We navigate the unlearning landscape in LLMs from conceptual formulation, methodologies, metrics, and applications. In particular, we highlight the often-overlooked aspects of existing LLM unlearning research, e.g., unlearning scope, data-model interaction, and multifaceted efficacy assessment. We also draw connections between LLM unlearning and related areas such as model editing, influence functions, model explanation, adversarial training, and reinforcement learning. Furthermore, we outline an effective assessment framework for LLM unlearning and explore its applications in copyright and privacy safeguards and sociotechnical harm reduction.

Rethinking Machine Unlearning for LLMs

The paper "Rethinking Machine Unlearning for LLMs" offers a comprehensive examination of the emerging field of unlearning within the context of LLMs. This work is pivotal for the life-cycle management of LLMs, as it aims to remove undesirable data influence, which may contain sensitive or illegal information, while maintaining the integrity of essential knowledge and ensuring model efficiency without the need for full retraining.

Key Contributions and Insights

The paper outlines several key contributions to the field:

  1. Conceptual Formulation:
    • The authors define LLM unlearning as the process of efficiently and effectively eliminating the influence of specific 'unlearning targets' and associated model capabilities while preserving performance for non-targets.
    • This process includes distinguishing the specific dataset subsets and/or knowledge concepts to unlearn, noting the intertwined data-model interactions that form the crux of influence erasure.
  2. Unlearning Methods:
    • Several approaches to LLM unlearning are discussed, with emphasis on model-based methods such as gradient ascent and its variants, localization-informed unlearning, and influence function-based methods.
    • The paper also explores input-based approaches, though it suggests that these may be weaker compared to model-based methods due to the difficulty of eliminating influence solely through input modifications.
  3. Evaluation Framework:
    • For effective assessment, the authors advocate evaluating the model based on in-scope and out-of-scope examples, efficiency metrics (computation and memory costs), and comparison with retraining as a gold standard, among others.
    • They highlight the importance of evaluating for robustness with hard in-scope examples and setting rigorous criteria for true unlearning.
  4. Applications:
    • Applications of LLM unlearning extend to copyright and privacy safeguards and initiatives to reduce sociotechnical harm, such as toxic content generation and adherence to AI alignment protocols.

Theoretical and Practical Implications

The implications of this research are significant in both theoretical and practical domains:

  • Theoretical Impacts:
    • The proposed methodologies advance the understanding of data-model interactions and localized influence within LLMs, providing a foundation for further research on influence erasure in AI models.
    • The emphasis on adversarial training as a part of unlearning methodologies could lead to more robust AI models resistant to adversarial attacks.
    • Enhancing the unlearning paradigm to be more authentic and precise may lead to the development of more trustworthy and safe AI systems.
  • Practical Impacts:
    • This research can directly influence AI policy, particularly in contexts requiring legal compliance such as the 'right to be forgotten' and algorithmic disgorgement.
    • The application of unlearning techniques can mitigate risks of privacy leakage and reduce harmful outputs, providing more aligned and secure AI services.
    • Adopting localization-informed techniques might offer computational efficiency, making the unlearning process feasible for large-scale models deployed in real-world scenarios.

Speculation on Future Developments

Looking forward, several future developments can be anticipated in AI unlearning research:

  • Refinement in localization-informed unlearning methods, facilitating more precise and efficient influence removal.
  • In-depth exploration of adversarial unlearning to guard against sophisticated jailbreak attacks and adversarial prompts.
  • The creation of standardized detailed benchmarks and datasets to evaluate the unlearning processes consistently across different domains and applications.

Conclusion

The paper "Rethinking Machine Unlearning for LLMs" provides foundational insights and suggests rigorous formulations for the emerging field of LLM unlearning. The proposed approaches highlight the importance of balancing efficient unlearning with the retention of critical model capabilities. The discussions around applications and the intricacies of evaluation metrics serve as a pivotal guide for future research trajectories and practical deployments, fostering the development of safe, secure, and reliable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (126)
  1. Ai model disgorgement: Methods and choices. arXiv preprint arXiv:2304.03545, 2023.
  2. If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems, 35:17953–17967, 2022.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. Identifying and mitigating the security risks of generative ai. Foundations and Trends® in Privacy and Security, 6(1):1–52, 2023.
  5. Evaluating machine unlearning via epistemic uncertainty. arXiv preprint arXiv:2208.10836, 2022.
  6. From algorithmic destruction to algorithmic imprint: Generative ai and privacy risks linked to potential traces of personal data in trained models. In ICML Workshop on Generative AI + Law, 2023.
  7. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623, 2021.
  8. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp.  141–159. IEEE, 2021.
  9. Structured access for third-party research on frontier ai models: Investigating researchers’model access requirements. 2023.
  10. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp.  463–480. IEEE, 2015.
  11. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp.  267–284, 2019.
  12. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  13. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  14. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
  15. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  16. Black-box access is insufficient for rigorous ai audits, 2024.
  17. Fast federated machine unlearning with nonlinear functional theory. In International conference on machine learning, pp.  4241–4268. PMLR, 2023.
  18. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150, 2023.
  19. Graph unlearning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp.  499–513, 2022.
  20. Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7766–7775, 2023a.
  21. Fast model debias with machine unlearning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  22. Efficient model updates for approximate unlearning of graph-structured data. In The Eleventh International Conference on Learning Representations, 2022.
  23. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  24. Evaluating the ripple effects of knowledge editing in language models. arXiv preprint arXiv:2307.12976, 2023.
  25. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
  26. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
  27. Who’s harry potter? approximate unlearning in llms, 2023.
  28. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. In International Conference on Learning Representations, 2024.
  29. Erasing concepts from diffusion models. arXiv preprint arXiv:2303.07345, 2023.
  30. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  31. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  32. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  33. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.
  34. Goland, J. A. Algorithmic disgorgement: Destruction of artificial intelligence models as the ftc’s newest enforcement tool for bad data. Richmond Journal of Law and Technology, 29(2), 2023.
  35. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9304–9312, 2020.
  36. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
  37. The times sues openai and microsoft over a.i. use of copyrighted work. The New York Times, 2023. URL https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html. Accessed: 2024-01-16.
  38. Model editing can hurt general abilities of large language models. arXiv preprint arXiv:2401.04700, 2024.
  39. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  40. Editing common sense in transformers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8214–8232, 2023.
  41. Federated unlearning: How to efficiently erase a client in fl? arXiv preprint arXiv:2207.05521, 2022.
  42. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  43. Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs. EACL, 2023b.
  44. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023.
  45. The european union general data protection regulation: what it is and what it means. Information & Communications Technology Law, 28(1):65–98, 2019.
  46. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
  47. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  48. Knowledge sanitization of large language models. arXiv preprint arXiv:2309.11852, 2023.
  49. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp.  2008–2016. PMLR, 2021.
  50. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
  51. Model sparsity can simplify machine unlearning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  52. Fairsisa: Ensemble post-processing to improve fairness of unlearning in llms. arXiv preprint arXiv:2312.07420, 2023.
  53. Copyright violations and large language models. arXiv preprint arXiv:2310.13771, 2023.
  54. Understanding black-box predictions via influence functions. In International conference on machine learning, pp.  1885–1894. PMLR, 2017.
  55. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, pp.  12–24, 2023.
  56. Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573, 2022.
  57. Harnessing the vulnerability of latent layers in adversarially trained models. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp.  2779–2785, 2019.
  58. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  22691–22702, 2023.
  59. Towards unbounded machine unlearning. arXiv preprint arXiv:2302.09880, 2023.
  60. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  61. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
  62. Large language models with controllable working memory. arXiv preprint arXiv:2211.05110, 2022.
  63. A survey of large language models attribution. arXiv preprint arXiv:2311.03731, 2023a.
  64. Circuit breaking: Removing model behaviors with targeted ablation. arXiv preprint arXiv:2309.05973, 2023b.
  65. Li, T. C. Algorithmic destruction. SMU Law Rev., 75:479, 2022.
  66. Pmet: Precise model editing in a transformer. arXiv preprint arXiv:2308.08742, 2023c.
  67. Anti-backdoor learning: Training clean models on poisoned data. Advances in Neural Information Processing Systems, 34:14900–14912, 2021.
  68. Federated unlearning. arXiv preprint arXiv:2012.13891, 2020.
  69. Backdoor defense with machine unlearning. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pp.  280–289. IEEE, 2022.
  70. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023a.
  71. A survey on federated unlearning: Challenges, methods, and future directions. arXiv preprint arXiv:2310.20448, 2023b.
  72. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
  73. Memory-assisted prompt editing to improve gpt-3 after deployment. arXiv preprint arXiv:2201.06009, 2022.
  74. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  75. Tofu: A task of fictitious unlearning for llms, 2024.
  76. Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
  77. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  78. Towards modular machine learning solution development: Benefits and trade-offs. arXiv preprint arXiv:2301.09753, 2023.
  79. Memory-based model editing at scale. In International Conference on Machine Learning, pp.  15817–15831. PMLR, 2022.
  80. More human than human: Measuring chatgpt political bias. Available at SSRN 4372349, 2023.
  81. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  82. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, pp.  931–962. PMLR, 2021.
  83. A survey of machine unlearning. arXiv preprint arXiv:2209.02299, 2022.
  84. Fair machine unlearning: Data removal while mitigating disparities. arXiv preprint arXiv:2307.14754, 2023.
  85. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  86. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. ICLR, 2024.
  87. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
  88. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022.
  89. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  90. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  91. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455, 2023.
  92. Adversarial training should be cast as a non-zero-sum game. arXiv preprint arXiv:2306.11035, 2023.
  93. Fair infinitesimal jackknife: Mitigating the influence of biased training data points without refitting. Advances in Neural Information Processing Systems, 35:35894–35906, 2022.
  94. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086, 2021.
  95. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.
  96. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
  97. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  98. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.  3–18. IEEE, 2017.
  99. Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766, 2023.
  100. Small, Z. Sarah silverman sues openai and meta over copyright infringement. The New York Times, 2023. URL https://www.nytimes.com/2023/07/10/arts/sarah-silverman-lawsuit-openai-meta.html. Accessed: 2024-01-16.
  101. Evaluating and mitigating discrimination in language model decisions. arXiv preprint arXiv:2312.03689, 2023.
  102. Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pp.  303–319. IEEE, 2022.
  103. Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011, 2023.
  104. Machine unlearning via algorithmic stability. In Conference on Learning Theory, pp.  4126–4142. PMLR, 2021.
  105. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp.  707–723. IEEE, 2019.
  106. Federated unlearning via class-discriminative pruning. In Proceedings of the ACM Web Conference 2022, pp.  622–632, 2022.
  107. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
  108. Machine unlearning of features and labels. arXiv preprint arXiv:2108.11577, 2021.
  109. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  110. Unveiling the implicit toxicity in large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  111. Certified edge unlearning for graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  2606–2617, 2023a.
  112. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138, 2023b.
  113. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  114. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023.
  115. Low-resource languages jailbreak gpt-4, 2023.
  116. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  6032–6048, 2023.
  117. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  118. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941, 2023a.
  119. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591, 2023b.
  120. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023c.
  121. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In International Conference on Machine Learning, pp.  26693–26712. PMLR, 2022.
  122. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images… for now. arXiv preprint arXiv:2310.11868, 2023d.
  123. Can we edit factual knowledge by in-context learning? arXiv preprint arXiv:2305.12740, 2023.
  124. Mquake: Assessing knowledge editing in language models via multi-hop questions. arXiv preprint arXiv:2305.14795, 2023.
  125. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
  126. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Sijia Liu (204 papers)
  2. Yuanshun Yao (28 papers)
  3. Jinghan Jia (30 papers)
  4. Stephen Casper (40 papers)
  5. Nathalie Baracaldo (34 papers)
  6. Peter Hase (29 papers)
  7. Xiaojun Xu (30 papers)
  8. Yuguang Yao (24 papers)
  9. Hang Li (277 papers)
  10. Kush R. Varshney (121 papers)
  11. Mohit Bansal (304 papers)
  12. Sanmi Koyejo (111 papers)
  13. Yang Liu (2253 papers)
  14. Chris Yuhao Liu (9 papers)
Citations (49)