Digital Forgetting in Large Language Models: A Survey of Unlearning Methods (2404.02062v1)
Abstract: The objective of digital forgetting is, given a model with undesirable knowledge or behavior, obtain a new model where the detected issues are no longer present. The motivations for forgetting include privacy protection, copyright protection, elimination of biases and discrimination, and prevention of harmful content generation. Effective digital forgetting has to be effective (meaning how well the new model has forgotten the undesired knowledge/behavior), retain the performance of the original model on the desirable tasks, and be scalable (in particular forgetting has to be more efficient than retraining from scratch on just the tasks/data to be retained). This survey focuses on forgetting in LLMs. We first provide background on LLMs, including their components, the types of LLMs, and their usual training pipeline. Second, we describe the motivations, types, and desired properties of digital forgetting. Third, we introduce the approaches to digital forgetting in LLMs, among which unlearning methodologies stand out as the state of the art. Fourth, we provide a detailed taxonomy of machine unlearning methods for LLMs, and we survey and compare current approaches. Fifth, we detail datasets, models and metrics used for the evaluation of forgetting, retaining and runtime. Sixth, we discuss challenges in the area. Finally, we provide some concluding remarks.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318.
- MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
- Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow.
- A critical review on the use (and misuse) of differential privacy in machine learning. ACM Computing Surveys, 55(8):1–16.
- Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, page 491–500, New York, NY, USA. Association for Computing Machinery.
- Borkar, J. (2023). What can we learn from data leakage and unlearning for law? arXiv preprint arXiv:2307.10476.
- Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE.
- Language models are few-shot learners.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Report on the 11th IWSLT evaluation campaign. In Federico, M., Stüker, S., and Yvon, F., editors, Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign@IWSLT 2014, Lake Tahoe, CA, USA, December 4-5, 2014.
- Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
- Computer, T. (2023). Redpajama: an open dataset for training large language models. https://github.com/togethercomputer/RedPajama-Data.
- Elastic weight removal for faithful and abstractive dialogue generation. arXiv preprint arXiv:2303.17574.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In danah boyd and Morgenstern, J. H., editors, Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pages 120–128. ACM.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations.
- Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
- European Commission (2019). Ethics guidelines for trustworthy AI. Publications Office of the EU.
- European Parliament and Council of the European Union (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). https://data.europa.eu/eli/reg/2016/679/oj.
- Hierarchical neural story generation. In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 889–898. Association for Computational Linguistics.
- Cook: Empowering general-purpose language models with modular and collaborative knowledge. arXiv preprint arXiv:2305.09955.
- Self-debiasing large language models: Zero-shot recognition and reduction of stereotypes. arXiv preprint arXiv:2402.01981.
- The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings.
- Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. CoRR, abs/1911.12237.
- SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Agirre, E., Bos, J., Diab, M., Manandhar, S., Marton, Y., and Yuret, D., editors, *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
- Automatic anonymization of textual documents: detecting sensitive information via word embeddings. In 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pages 358–365. IEEE.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
- Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504.
- Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. CoRR, abs/2307.04657.
- Pubmedqa: A dataset for biomedical research question answering. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2567–2577. Association for Computational Linguistics.
- Fairsisa: Ensemble post-processing to improve fairness of unlearning in llms. In Socially Responsible Language Modelling Research.
- Deduplicating training data mitigates privacy risks in language models.
- Introducing the enron corpus. In CEAS 2004 - First Conference on Email and Anti-Spam, July 30-31, 2004, Mountain View, California, USA.
- Internet-augmented dialogue generation. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460–8478, Dublin, Ireland. Association for Computational Linguistics.
- Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573.
- Towards unbounded machine unlearning. Advances in Neural Information Processing Systems, 36.
- ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Zero-shot relation extraction via reading comprehension. In Levy, R. and Specia, L., editors, Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, pages 333–342. Association for Computational Linguistics.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6449–6464. Association for Computational Linguistics.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
- Don’t forget about pronouns: Removing gender bias in language models without losing factual gender information. arXiv preprint arXiv:2206.10744.
- Truthfulqa: Measuring how models mimic human falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Forgetting private textual sequences in language models via leave-one-out ensemble. arXiv preprint arXiv:2309.16082.
- Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609.
- Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R., editors, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 142–150. The Association for Computer Linguistics.
- Memory-assisted prompt editing to improve gpt-3 after deployment. arXiv preprint arXiv:2201.06009.
- Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121.
- Locating and editing factual knowledge in GPT. CoRR, abs/2202.05262.
- Pointer sentinel mixture models.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Rethinking the role of demonstrations: What makes in-context learning work?
- Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035.
- A survey of machine unlearning. arXiv preprint arXiv:2209.02299.
- Forgetting before learning: Utilizing parametric arithmetic for knowledge updating in large language models. arXiv preprint arXiv:2311.08011.
- Universal dependencies v2: An evergrowing multilingual treebank collection. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4034–4043. European Language Resources Association.
- Nostalgebraist (2020). Interpreting GPT: The logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
- Scalable private learning with pate. arXiv preprint arXiv:1802.08908.
- Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410.
- In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579.
- Dissecting large language models. In Socially Responsible Language Modelling Research.
- Learn to unlearn: A survey on machine unlearning. arXiv preprint arXiv:2305.07512.
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Towards empathetic open-domain conversation models: A new benchmark and dataset. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
- Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667.
- Rowling, J. K. (2000). Harry Potter and the Sorcerer’s Stone. Bloomsbury.
- Gender bias in coreference resolution. In Walker, M., Ji, H., and Stent, A., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
- Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
- Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246.
- C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology, 67(1):148–163.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- BLEURT: learning robust metrics for text generation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7881–7892. Association for Computational Linguistics.
- Exploring the landscape of machine unlearning: A survey and taxonomy. arXiv preprint arXiv:2305.06360.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
- Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE.
- A gold standard dependency corpus for English. In Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2897–2904, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2021–2030, Online. Association for Computational Linguistics.
- Identifying and mitigating privacy risks stemming from language models: A survey. arXiv preprint arXiv:2310.01424.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S., editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Evaluating gender bias in machine translation. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
- Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- LEDGAR: A large-scale multi-label corpus for text classification of legal provisions in contracts. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 1235–1241. European Language Resources Association.
- Natural language processing with transformers. ” O’Reilly Media, Inc.”.
- United Nations (1948). Universal Declaration of Human Rights.
- Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
- Attention is all you need. Advances in neural information processing systems, 30.
- Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535.
- Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. arXiv preprint arXiv:2402.05813.
- Mind the GAP: A balanced corpus of gendered ambiguous pronouns. Transactions of the Association for Computational Linguistics, 6:605–617.
- Chain-of-thought prompting elicits reasoning in large language models.
- Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR.
- Machine unlearning: A survey. ACM Computing Surveys, 56(1):1–36.
- Large language model unlearning. arXiv preprint arXiv:2310.10683.
- Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE.
- Gradient ascent post-training enhances language model generalization. arXiv preprint arXiv:2306.07052.
- Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6032–6048.
- Characterizing mechanisms for factual recall in language models. arXiv preprint arXiv:2310.15910.
- Hellaswag: Can a machine really finish your sentence? CoRR, abs/1905.07830.
- Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941.
- A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286.
- Personalizing dialogue agents: I have a dog, do you have pets too? In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models.
- Character-level convolutional networks for text classification. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657.
- Gender bias in coreference resolution: Evaluation and debiasing methods. In Walker, M., Ji, H., and Stent, A., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Alberto Blanco-Justicia (13 papers)
- Najeeb Jebreel (4 papers)
- Benet Manzanares (1 paper)
- Josep Domingo-Ferrer (41 papers)
- Guillem Collell (6 papers)
- Kuan Eeik Tan (6 papers)
- David Sánchez (40 papers)