Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TOFU: A Task of Fictitious Unlearning for LLMs (2401.06121v1)

Published 11 Jan 2024 in cs.LG and cs.CL

Abstract: LLMs trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

Overview of TOFU

In the evolving landscape of AI, the privacy implications embedded within LLMs have become a prominent concern. LLMs, trained on wide-ranging internet data, possess an innate ability to recall and disseminate sensitive details, triggering apprehensions regarding data confidentiality and compliance with privacy regulations. To counter this, the concept of unlearning is gaining traction—modifying LLMs to obliterate traces of specific data they were trained on. Despite the availability of unlearning methodologies, their true efficacy remains disputed.

Introducing the Benchmark

Addressing this quandary, researchers have formulated TOFU—a benchmark enabling thorough analysis of unlearning processes. With an arsenal of 200 fabricated author profiles replete with question-answer pairs, TOFU delineates an unlearning challenge—efface all information related to a distinct subset of these profiles. The ambition is to distinguish between models that are oblivious to the so-called 'forget set' and those that remain informed. TOFU is designed to unravel the mysteries of unlearning, scrutinizing if AI can truly be made to forget.

Metrics for Measuring Unlearning

To gauge the success of unlearning, comprehensive metrics are concocted. The dual-pronged approach evaluates models on 'forget quality'—the similarity to models unacquainted with the forget set's data—and 'model utility,' the retention of the model's functionality sans the to-be-forgotten details. These metrics, shedding light on individual and collective performance indicators, offer an almost tangible grasp of unlearning outcomes.

The Unlearning Landscape

Baseline unlearning models, subjected to this stringent evaluative regime, reveal a stark landscape where effective unlearning appears to be a distant reality. Models struggle to eradicate knowledge discretely, with their performance waning alongside their capacity to forget. It underlines the intricacy of unlearning—stripping information without eroding the model's competence is a daunting endeavor.

Future Considerations

The findings from TOFU underline the pressing need for innovation in unlearning algorithms. Current efforts appear to merely scratch the surface, shrouding models in a veneer of forgetting without genuinely purging the underlying data. It's a clarion call to researchers and practitioners alike to craft strategies that enable LLMs to reconcile the contradiction inherent in learning to forget. As the dialogue matures, so too will the potential for AI to navigate the tightrope between knowledge retention and respecting the sanctity of data privacy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp.  141–159. IEEE, 2021.
  2. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  3. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  1897–1914. IEEE, 2022.
  4. Unlearn what you want to forget: Efficient unlearning for llms, 2023.
  5. On the properties of neural machine translation: Encoder–decoder approaches. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (eds.), Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp.  103–111, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-4012. URL https://aclanthology.org/W14-4012.
  6. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021.
  7. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
  8. Towards adversarial evaluations for inexact machine unlearning. arXiv preprint arXiv:2201.06640, 2022.
  9. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9304–9312, 2020.
  10. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  11. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation, 2023.
  12. Are large pre-trained language models leaking your personal information? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2038–2047, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.148. URL https://aclanthology.org/2022.findings-emnlp.148.
  13. Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33:22205–22216, 2020.
  14. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
  15. Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp.  1895–1912, 2019.
  16. Propile: Probing privacy leakage in large language models. arXiv preprint arXiv:2307.01881, 2023.
  17. The brainy student: Scalable unlearning by selectively disobeying the teacher, 2023a. URL https://openreview.net/forum?id=f9eHl5mKx5i.
  18. Towards unbounded machine unlearning. arXiv preprint arXiv:2302.09880, 2023b.
  19. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  20. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  21. Continual learning and private unlearning. In Conference on Lifelong Learning Agents, pp.  243–254. PMLR, 2022.
  22. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
  23. Dataset inference: Ownership resolution in machine learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=hvdKKV2yt7T.
  24. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.  109–165. Elsevier, 1989.
  25. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  26. Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pp.  866–882. IEEE, 2021.
  27. CA OAG. Ccpa regulations: Final regulation text. Office of the Attorney General, California Department of Justice, 2021.
  28. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023.
  29. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
  30. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  31. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086, 2021.
  32. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  33. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.  3–18. IEEE, 2017.
  34. Privacy auditing with one (1) training run. arXiv preprint arXiv:2305.08846, 2023.
  35. On the necessity of auditable algorithmic definitions for machine unlearning. In 31st USENIX Security Symposium (USENIX Security 22), pp.  4007–4022, 2022.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  37. European Union. Regulation (eu) 2016/679 of the european parliament and of the council. Official Journal of the European Union, 2016.
  38. Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10:3152676, 2017.
  39. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
  40. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  41. Large language model unlearning, 2023.
  42. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941, 2023.
  43. A comprehensive study of knowledge editing for large language models, 2024.
  44. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pratyush Maini (19 papers)
  2. Zhili Feng (22 papers)
  3. Avi Schwarzschild (35 papers)
  4. J. Zico Kolter (151 papers)
  5. Zachary C. Lipton (137 papers)
Citations (85)