Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning (2404.05868v2)

Published 8 Apr 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: LLMs often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through experiments on synthetic data and the benchmark TOFU dataset, we demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. We also observe that NPO-based methods generate more sensible outputs than GA-based methods, whose outputs are often gibberish. Remarkably, on TOFU, NPO-based methods are the first to achieve reasonable unlearning results in forgetting 50% (or more) of the training data, whereas existing methods already struggle with forgetting 10% of training data.

Negative Preference Optimization: A New Approach to LLM Unlearning

Introduction to Machine Unlearning in LLMs

The advent of LLMs has been paralleled by growing concerns around their ability to recall and reproduce sensitive or copyrighted data. This issue highlights the importance of developing efficient unlearning methods that can remove the influence of specific data subsets ("forget sets") without necessitating the retraining of the model from scratch, which is computationally prohibitive. Traditional methods, mostly relying on gradient ascent (GA) on the loss over the forget set, have shown limited success, often leading to catastrophic collapse or suboptimal unlearning-performance balance.

Addressing the Limitations of Gradient Ascent

In seeking solutions to these limitations, this paper introduces Negative Preference Optimization (NPO), drawing inspiration from preference optimization methods but uniquely focusing solely on negative samples for efficient and effective unlearning. Through theoretical analysis and empirical studies on synthetic and benchmark data (TOFU), NPO demonstrates superior performance over GA, mitigating the catastrophic collapse phenomenon and improving the balance between forget quality and model utility.

Negative Preference Optimization (NPO) Explained

NPO reframes unlearning as a preference optimization problem, albeit without positive counterparts to the undesirable data samples. It replaces the unbounded nature of GA loss with a more controlled loss function, leading to a slower divergence and more stable training dynamics. Theoretical models illuminate the exponentially slower progression toward catastrophic collapse with NPO compared to GA, suggesting an underlying mechanism for its effectiveness.

Advancements and Contributions

The paper's experimental validations reveal that:

  • NPO provides a better trade-off between forgetting and retaining information compared to existing methods.
  • It achieves notable unlearning results on large subsets of data (up to 50% and more), significantly outpacing previous methods.
  • The incorporation of a retain loss term within the NPO framework further enhances its performance, promoting balance between unlearning specific data while maintaining general model utility.

Implications and Future Directions

NPO's approach not only represents a significant step forward in the practical application of unlearning in LLMs but also opens new pathways for future research. Specifically, the potential to generalize the principles of NPO to tackle broader challenges in AI, beyond unlearning, poses an intriguing prospect. The success in handling larger percentages of forget sets with NPO suggests the possibility of extending this method to more complex or higher-stakes scenarios, including those with adversarial inputs or where even finer-grained unlearning is required.

Concluding Remarks

In summary, the introduction of Negative Preference Optimization offers a promising avenue for addressing the pressing issue of effectively unlearning from LLMs. By leveraging the concept of preference optimization solely with negative examples, this work not only circumvents the pitfalls associated with gradient ascent but also establishes a new benchmark for the efficiency and effectiveness of machine unlearning processes. As the field moves forward, the scalability and adaptability of NPO suggest a fertile ground for further innovation, pushing the boundaries of what's achievable in the dynamic and rapidly evolving field of generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, 2021.
  3. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp. 463–480. IEEE, 2015.
  4. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
  5. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  6. CCPA. California consumer privacy act of 2018. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375, 2018. AB-375, Signed into law on June 28, 2018.
  7. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150, 2023.
  8. Negating negatives: Alignment without human positive samples via distributional dispreference optimization. arXiv preprint arXiv:2403.03419, 2024.
  9. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
  10. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  11. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.
  12. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9304–9312, 2020.
  13. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  14. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628, 2022.
  15. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp.  2008–2016. PMLR, 2021.
  16. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
  17. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
  18. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885–1894. PMLR, 2017.
  19. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024.
  20. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  21. Rethinking machine unlearning for large language models. arXiv preprint arXiv:2402.08787, 2024a.
  22. Towards safer large language models through machine unlearning. arXiv preprint arXiv:2402.10058, 2024b.
  23. Eight methods to evaluate robust unlearning in llms. arXiv preprint arXiv:2402.16835, 2024.
  24. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024.
  25. Alessandro Mantelero. The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’. Computer Law & Security Review, 29(3):229–235, 2013.
  26. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  27. Memory-based model editing at scale. In International Conference on Machine Learning, pp. 15817–15831. PMLR, 2022.
  28. A survey of machine unlearning. arXiv preprint arXiv:2209.02299, 2022.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  30. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023.
  31. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
  32. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  33. Jonas B Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. arXiv preprint arXiv:2306.13952, 2023.
  34. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086, 2021.
  35. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  36. Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766, 2023.
  37. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  38. Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pp.  303–319. IEEE, 2022.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017.
  41. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
  42. Machine unlearning: A survey. ACM Computing Surveys, 56(1):1–36, 2023.
  43. Machine unlearning of pre-trained large language models. arXiv preprint arXiv:2402.15159, 2024.
  44. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023.
  45. A review on machine unlearning. SN Computer Science, 4(4):337, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ruiqi Zhang (58 papers)
  2. Licong Lin (17 papers)
  3. Yu Bai (136 papers)
  4. Song Mei (56 papers)
Citations (65)