UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models (2402.10052v2)
Abstract: Mitigating the retention of sensitive or private information in LLMs is essential for enhancing privacy and safety. Existing unlearning methods, like Gradient Ascent and Negative Preference Optimization, directly tune models to remove unwanted information. However, these methods often become unstable because they fine-tune by maximizing cross-entropy loss, which is the opposite of traditional loss minimization in learning. This reversal creates instability, especially on larger datasets, as the model struggles to balance unlearning with maintaining language capacity, leading to over-unlearning. In this paper, we introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method. Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens. This technique ensures smooth convergence and avoids catastrophic forgetting, even in challenging unlearning tasks with large datasets and sequential unlearning requests. Extensive experiments show that UnDIAL can achieve both robustness in unlearning and scalability while maintaining stable training dynamics and resilience to hyperparameter tuning.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016.
- Large-scale differentially private bert. arXiv preprint arXiv:2108.01624, 2021.
- Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, 2014.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715.
- Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp. 463–480. IEEE, 2015.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
- Quantifying memorization across neural language models. In International Conference on Learning Representations (ICLR), 2023.
- CCPA. California Consumer Privacy Act of 2018. California Legislative Information, 2018. URL https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375.
- Unlearn what you want to forget: Efficient unlearning for llms. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Who’s harry potter? approximate unlearning in llms, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- GDPR. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union, 2016. URL https://eur-lex.europa.eu/eli/reg/2016/679/oj.
- Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.
- Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 394–398, 2012.
- Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11516–11524, 2021.
- Certified data removal from machine learning models. In International Conferences on Machine Learning (ICML), 2020.
- news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science, pp. 218–223, March 2017. doi: 10.5281/zenodo.4120316.
- Heikkilä, M. What does gpt-3 “know” about me, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2021.
- Editing Models with Task Arithmetic, March 2023. URL http://arxiv.org/abs/2212.04089. arXiv:2212.04089 [cs].
- Knowledge unlearning for mitigating privacy risks in language models. ACL, 2023.
- Use of personal information for artificial intelligence learning data under the personal information protection act: the case of lee-luda, an artificial-intelligence chatbot in south korea. Asia Pacific Law Review, 31(1):55–72, 2023.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pp. 10697–10707. PMLR, 2022.
- Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations (ICLR), 2020.
- Towards unbounded machine unlearning. arXiv preprint arXiv:2302.09880, 2023.
- Deduplicating training data makes language models better. ACL, 2022.
- Levine, D. S. Generative artificial intelligence and trade secrecy. J. Free Speech L., 3:559, 2023.
- Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021.
- Contrastive Decoding: Open-ended Text Generation as Optimization, July 2023. URL http://arxiv.org/abs/2210.15097. arXiv:2210.15097 [cs].
- Fixing weight decay regularization in Adam. CoRR, abs/1711.05101, 2017.
- Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024.
- How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670, 2023.
- Deep unlearning via randomized conditionally independent hessians. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
- Microsoft. Github copilot. https://copilot.github.com/, 2023.
- Quantifying privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929, 2022.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- Variational bayesian unlearning. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- NY Times. New York Times Company v. Microsoft, OpenAI. https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf, December 2023. No. 12345 (S.D.N.Y. December 2023).
- OpenAI. Introducing chatgpt. 2022.
- OpenAI. Chatgpt. https://openai.com/chatgpt, 2023a.
- OpenAI. Gpt-4 technical report. 2023b.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828, 2021.
- Sag, M. Copyright safety for generative ai. Forthcoming in the Houston Law Review, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp. 3–18. IEEE, 2017.
- Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Differentially private learning needs better features (or much more data). In International Conference on Learning Representations (ICLR), 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
- White, J. How strangers got my email address from chatgpt’s model, 2023.
- Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023.
- Large scale private learning via low-rank reparametrization. In International Conferences on Machine Learning (ICML), 2021.
- Differentially private fine-tuning of language models. In International Conference on Learning Representations (ICLR), 2022.
- Hellaswag: Can a machine really finish your sentence? ACL, 2019.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3713–3722, 2019.