Machine Unlearning in Generative AI: A Survey (2407.20516v1)
Abstract: Generative AI technologies have been deployed in many places, such as (multimodal) LLMs and vision generative models. Their remarkable performance should be attributed to massive training data and emergent reasoning abilities. However, the models would memorize and generate sensitive, biased, or dangerous information originated from the training data especially those from web crawl. New machine unlearning (MU) techniques are being developed to reduce or eliminate undesirable knowledge and its effects from the models, because those that were designed for traditional classification tasks could not be applied for Generative AI. We offer a comprehensive survey on many things about MU in Generative AI, such as a new problem formulation, evaluation methods, and a structured discussion on the advantages and limitations of different kinds of MU techniques. It also presents several critical challenges and promising directions in MU research. A curated list of readings can be found: https://github.com/franciscoliu/GenAI-MU-Reading.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836 (2022).
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319 (2019).
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv preprint arXiv:2404.02151 (2024).
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
- P Bedapudi. 2019. Nudenet: Neural nets for nudity classification, detection and selective censoring.
- On the dangers of stochastic parrots: Can language models be too big?. In FAccT.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023).
- Digital Forgetting in Large Language Models: A Survey of Unlearning Methods. arXiv preprint arXiv:2404.02062 (2024).
- Nuanced metrics for measuring unintended bias with real data for text classification. In WWW.
- Machine unlearning. In IEEE Symposium on Security and Privacy (SP).
- Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In IEEE symposium on security and privacy.
- Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150 (2023).
- Jiali Cheng and Hadi Amiri. 2023. Multimodal Machine Unlearning. arXiv preprint arXiv:2311.12047 (2023).
- Can We Edit Multimodal Large Language Models? arXiv preprint arXiv:2310.08475 (2023).
- Efficient model updates for approximate unlearning of graph-structured data. In ICLR.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In ICCV.
- Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher. In AAAI.
- Zero-shot machine unlearning. IEEE Transactions on Information Forensics and Security (2023).
- Evaluating the ripple effects of knowledge editing in language models. ACL (2024).
- Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023).
- Zheng Dai and David K Gifford. 2023. Training data attribution for diffusion models. arXiv preprint arXiv:2306.02174 (2023).
- Quang-Vinh Dang. 2021. Right to be forgotten in the age of machine learning. In Advances in Digital Science: ICADS 2021.
- Larimar: Large Language Models with Episodic Memory Control. arXiv preprint arXiv:2403.11901 (2024).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination. arXiv preprint arXiv:2402.10052 (2024).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Avoiding Copyright Infringement via Machine Unlearning. arXiv preprint arXiv:2406.10952 (2024).
- Lifelong anomaly detection through unlearning. In CCS.
- Olivia G d’Aliberti and Mark A Clark. 2022. Preserving patient privacy during computation over shared electronic health record data. Journal of Medical Systems (2022).
- Ronen Eldan and Mark Russinovich. 2023. Who’s Harry Potter? Approximate Unlearning in LLMs. arXiv preprint arXiv:2310.02238 (2023).
- Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. arXiv preprint arXiv:2310.12508 (2023).
- Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
- Linear mode connectivity and the lottery ticket hypothesis. In ICML.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
- Erasing concepts from diffusion models. In ICCV.
- Unified concept editing in diffusion models. In WACV.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027 (2020).
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022).
- Making ai forget you: Data deletion in machine learning. Neurips (2019).
- SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization.
- Eternal sunshine of the spotless net: Selective forgetting in deep networks. In CVPR.
- Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In ECCV.
- Generative adversarial nets. Neurips (2014).
- Model editing can hurt general abilities of large language models. arXiv preprint arXiv:2401.04700 (2024).
- Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models. arXiv preprint arXiv:2403.10557 (2024).
- Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv preprint arXiv:2402.08567 (2024).
- Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030 (2019).
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
- Alvin Heng and Harold Soh. 2024. Selective amnesia: A continual learning approach to forgetting in deep generative models. Neurips (2024).
- Denoising diffusion probabilistic models. Neurips (2020).
- The European Union general data protection regulation: what it is and what it means. Information & Communications Technology Law (2019).
- Parameter-efficient transfer learning for NLP. In ICML.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
- Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. In AAAI.
- Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers. arXiv preprint arXiv:2311.17717 (2023).
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987 (2023).
- Offset Unlearning for Large Language Models. arXiv preprint arXiv:2404.11045 (2024).
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089 (2022).
- Patching open-vocabulary models by interpolating weights. Neurips (2022).
- Erin Illman and Paul Temple. 2019. California consumer privacy act. The Business Lawyer (2019).
- Masaru Isonuma and Ivan Titov. 2024. Unlearning Reveals the Influential Training Data of Language Models. arXiv preprint arXiv:2401.15241 (2024).
- Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504 (2022).
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Neurips (2024).
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849 (2022).
- Fairsisa: Ensemble post-processing to improve fairness of unlearning in llms. arXiv preprint arXiv:2312.07420 (2023).
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017).
- Preserving Privacy Through Dememorization: An Unlearning Technique For Mitigating Memorization Risks In Language Models. In EMNLP.
- Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
- Overcoming catastrophic forgetting in neural networks. PNAS (2017).
- Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning.
- Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573 (2022).
- Ablating concepts in text-to-image diffusion models. In ICCV.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 (2021).
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
- Controllable text-to-image generation. In Neurips.
- Machine Unlearning for Image-to-Image Generative Models. arXiv preprint arXiv:2402.00351 (2024).
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In EMNLP.
- Pre-trained language models for text generation: A survey. Comput. Surveys (2024).
- The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. arXiv preprint arXiv:2403.03218 (2024).
- Pmet: Precise model editing in a transformer. In AAAI.
- Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data. arXiv preprint arXiv:2307.00456 (2023).
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023).
- Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning. arXiv preprint arXiv:2403.16257 (2024).
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
- Microsoft coco: Common objects in context. In ECCV.
- Continual learning and private unlearning. In CoLLAs.
- Mitigating hallucination in large multi-modal models via robust instruction tuning. In ICLR.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
- Visual instruction tuning. Neurips (2024).
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Neurips (2022).
- Rethinking Machine Unlearning for Large Language Models. arXiv preprint arXiv:2402.08787 (2024).
- Backdoor defense with machine unlearning. In IEEE INFOCOM.
- Towards Safer Large Language Models through Machine Unlearning. arXiv preprint arXiv:2402.10058 (2024).
- Breaking the trilemma of privacy, utility, efficiency via controllable machine unlearning. arXiv preprint arXiv:2310.18574 (2023).
- Zhe Liu and Ozlem Kalinli. 2024. Forgetting Private Textual Sequences in Language Models Via Leave-One-Out Ensemble. In ICASSP.
- Threats, attacks, and defenses in machine unlearning: A survey. arXiv preprint arXiv:2403.13682 (2024).
- Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge. arXiv preprint arXiv:2404.05880 (2024).
- Quark: Controllable text generation with reinforced unlearning. Neurips (2022).
- Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 generated stories. In Proceedings of the third workshop on narrative understanding.
- Eight Methods to Evaluate Robust Unlearning in LLMs. arXiv preprint arXiv:2402.16835 (2024).
- Learning word vectors for sentiment analysis. In ACL.
- Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121 (2024).
- Hard to forget: Poisoning attacks on certified machine unlearning. In AAAI.
- Michael Matena and Colin Raffel. [n. d.]. Merging models with fisher-weighted averaging, 2021. arXiv preprint arXiv:2111.09832 ([n. d.]).
- Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Neurips (2022).
- Hatexplain: A benchmark dataset for explainable hate speech detection. In AAAI.
- Locating and editing factual associations in GPT. Neurips (2022).
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
- Memory-based model editing at scale. In ICML.
- Feature unlearning for pre-trained gans and vaes. In AAAI.
- StereoSet: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456 (2020).
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133 (2020).
- Variational bayesian unlearning. Neurips (2020).
- A survey of machine unlearning. arXiv preprint arXiv:2209.02299 (2022).
- Forgetting before Learning: Utilizing Parametric Arithmetic for Knowledge Updating in Large Language Models. arXiv preprint arXiv:2311.08011 (2023).
- Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309 (2024).
- Editing implicit assumptions in text-to-image diffusion models. In ICCV.
- Training language models to follow instructions with human feedback. Neurips (2022).
- Unlearning graph classifiers with limited data resources. In WWW.
- Subhodip Panda and Prathosh AP. 2023. FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models. arXiv preprint arXiv:2312.14895 (2023).
- Stuart L Pardau. 2018. The california consumer privacy act: Towards a european-style privacy regime in the united states. J. Tech. L. & Pol’y (2018).
- In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579 (2023).
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251 (2022).
- Nicholas Pochinkov and Nandi Schoots. 2024. Dissecting Language Models: Machine Unlearning via Selective Pruning. arXiv preprint arXiv:2403.01267 (2024).
- Formerly Data Protection. 2018. General data protection regulation (GDPR). Intersoft Consulting, Accessed in October (2018).
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693 (2023).
- The Frontier of Data Erasure: Machine Unlearning for Large Language Models. arXiv preprint arXiv:2403.15779 (2024).
- Learning transferable visual models from natural language supervision. In ICML.
- Improving language understanding by generative pre-training. (2018).
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020).
- Copyright Protection in Generative AI: A Technical Perspective. arXiv preprint arXiv:2402.02333 (2024).
- Object hallucination in image captioning. arXiv preprint arXiv:1809.02156 (2018).
- High-resolution image synthesis with latent diffusion models. In CVPR.
- Jeffrey Rosen. 2011. The right to be forgotten. Stan. L. Rev. Online (2011).
- JK Rowling. 1997. Harry Potter [book series]. London: Bloomsbury and Little, Brown (1997).
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR.
- A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv preprint arXiv:2402.07927 (2024).
- Addressing cognitive bias in medical language models. arXiv preprint arXiv:2402.08113 (2024).
- Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In CVPR.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Neurips (2022).
- Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324 (2023).
- Continual learning with deep generative replay. Neurips (2017).
- Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766 (2023).
- Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In EMNLP.
- Generative adversarial networks unlearning. arXiv preprint arXiv:2308.09881 (2023).
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525 (2023).
- Evaluating and mitigating discrimination in language model decisions. arXiv preprint arXiv:2312.03689 (2023).
- Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning. arXiv preprint arXiv:2402.04401 (2024).
- Guardrail Baselines for Unlearning in LLMs. arXiv preprint arXiv:2403.03329 (2024).
- Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P).
- Adapt then Unlearn: Exploiting Parameter Space Semantics for Unlearning in Generative Adversarial Networks. arXiv preprint arXiv:2309.14054 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- LEDGAR: A large-scale multi-label corpus for text classification of legal provisions in contracts. In LREC.
- Attention is all you need. Neurips (2017).
- Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR (2010).
- Biasasker: Measuring the bias in conversational ai system. In FSE Conference.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
- Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535 (2023).
- Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. arXiv preprint arXiv:2402.05813 (2024).
- Editing Conceptual Knowledge for Large Language Models. arXiv preprint arXiv:2403.06259 (2024).
- Large Scale Knowledge Washing. arXiv preprint arXiv:2405.16720 (2024).
- Machine unlearning of features and labels. arXiv preprint arXiv:2108.11577 (2021).
- Albert Webson and Ellie Pavlick. 2021. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247 (2021).
- Jailbroken: How does llm safety training fail? Neurips (2024).
- Mika Westerlund. 2019. The emergence of deepfake technology: A review. Technology innovation management review (2019).
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML.
- Robust fine-tuning of zero-shot models. In CVPR.
- Erasediff: Erasing data influence in diffusion models. arXiv preprint arXiv:2401.05779 (2024).
- Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138 (2023).
- Jailbreaking gpt-4v via self-adversarial attacks with system prompts. arXiv preprint arXiv:2311.09127 (2023).
- EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models. arXiv preprint arXiv:2402.09801 (2024).
- Machine Unlearning: A Survey. CSUR (2023).
- Yi Xu. 2024. Machine Unlearning for Traditional Models and Large Language Models: A Short Survey. arXiv preprint arXiv:2404.01206 (2024).
- Machine Unlearning of Pre-trained Large Language Models. arXiv preprint arXiv:2402.15159 (2024).
- Large language model unlearning. arXiv preprint arXiv:2310.10683 (2023).
- Unlearning bias in language models by partitioning gradients. In Findings of the ACL.
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849 (2023).
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
- Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941 (2023).
- Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591 (2023).
- Composing Parameter-Efficient Modules with Arithmetic Operation. Neurips (2024).
- Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning. arXiv preprint arXiv:2404.05868 (2024).
- Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243 (2018).
- To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images… for now. arXiv preprint arXiv:2310.11868 (2023).
- UnlearnCanvas: A Stylized Image Dataset to Benchmark Machine Unlearning for Diffusion Models. arXiv preprint arXiv:2402.11846 (2024).
- Learning and forgetting unsafe examples in large language models. arXiv preprint arXiv:2312.12736 (2023).
- A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- Judging llm-as-a-judge with mt-bench and chatbot arena. Neurips (2024).
- Making harmful behaviors unlearnable for large language models. arXiv preprint arXiv:2311.02105 (2023).
- Towards language-free training for text-to-image generation. In CVPR.
- Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In ICCV.