Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eight Methods to Evaluate Robust Unlearning in LLMs (2402.16835v1)

Published 26 Feb 2024 in cs.CL
Eight Methods to Evaluate Robust Unlearning in LLMs

Abstract: Machine unlearning can be useful for removing harmful capabilities and memorized text from LLMs, but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.

Comprehensive Evaluation of Unlearning Techniques in LLMs

Introduction to Unlearning in LLMs

LLMs have become central to advancing AI capabilities, offering unprecedented opportunities for natural language understanding and generation. However, their ability to retain and potentially reveal sensitive information has raised significant concerns regarding privacy, copyright, and the propagation of harmful content. In response, machine unlearning has emerged as a technique aimed at selectively removing undesired knowledge from LLMs, without compromising their general utility. Yet, the effectiveness and robustness of unlearning methods remain underexplored, with existing evaluations relying largely on ad-hoc or limited metrics. This paper presents an in-depth evaluation of the "Who’s Harry Potter" (WHP) unlearning technique, utilizing a comprehensive suite of tests to assess its effectiveness and reveal its limitations.

Evaluating Unlearning Robustness

The evaluation focuses on several dimensions, including traditional metrics like retention and forgetting tests, as well as novel approaches that test the model's resilience to knowledge extraction, the impact of relearning, and unintended side effects in related domains. Our analysis uncovers several key findings:

  • Generalization of Unlearning: The WHP model demonstrates a consistent reduction in familiarity with Harry Potter content, suggesting successful unlearning. However, the measure of familiarity employed may overly favor the specific unlearning method used, raising questions about the metric's general applicability.
  • Knowledge Extraction: Despite the unlearning, higher-than-baseline levels of knowledge about Harry Potter can still be extracted from the WHP model. This includes using techniques like jailbreak prompts and in-context relearning, indicating that the model retains latent knowledge that can be accessed through advanced querying methods.
  • Performance on Downstream Tasks: The WHP model's performance on trivia-based evaluations and Q&A tasks related to Harry Potter content remains nearly on par with the original model, suggesting that substantial knowledge about the domain persists post-unlearning.
  • Latent Knowledge and Side Effects: Analysis of latent knowledge via supervised and unsupervised probing techniques reveals comparable levels of retained information between the WHP and original models. Additionally, the WHP model exhibits collateral unlearning effects in domains related to Harry Potter, indicating unintended consequences of the unlearning process.

Theoretical and Practical Implications

These findings underscore several critical challenges for the development of machine unlearning techniques in LLMs. Firstly, the persistence of latent knowledge, despite targeted unlearning efforts, highlights the complex nature of knowledge representation in neural networks and the difficulty of ensuring complete knowledge removal. Secondly, the unintended collateral unlearning in related domains raises concerns about the specificity and control of unlearning interventions, which must be addressed to avoid compromising the model's utility in other contexts.

Future Directions in Unlearning

The demonstrated limitations of the WHP model and its unlearning approach prompt a reevaluation of current strategies and encourage the exploration of alternative methods. Future research should aim to develop unlearning techniques that ensure more thorough knowledge removal, resist adversarial attempts to extract unlearned information, and minimize unintended side effects. Moreover, the development of standardized, comprehensive evaluation metrics is crucial to accurately assess unlearning effectiveness and compare different approaches. By addressing these challenges, we can make significant strides toward safer and more responsible AI systems.

Conclusion

This evaluation of the WHP model's unlearning technique reveals critical insights into the current state of machine unlearning in LLMs. While the WHP model demonstrates some degree of success in forgetting targeted content, significant challenges remain in ensuring the complete and specific removal of undesired knowledge. By highlighting these issues and proposing directions for future research, this work contributes to the ongoing efforts to align LLM capabilities with ethical and social standards, ensuring their safe and beneficial application across various domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  3. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp.  141–159. IEEE, 2021.
  4. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  5. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp.  463–480. IEEE, 2015.
  6. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  7. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150, 2023.
  8. Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
  9. Who’s harry potter? approximate unlearning in llms. ArXiv, abs/2310.02238, 2023. URL https://api.semanticscholar.org/CorpusID:263608437.
  10. Coercing llms to do and reveal (almost) anything, 2024.
  11. Corrective machine unlearning, 2024.
  12. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  13. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
  14. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp.  287–296, 2023.
  15. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
  16. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
  17. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  18. Knowledge sanitization of large language models. arXiv preprint arXiv:2309.11852, 2023.
  19. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023.
  20. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
  21. Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
  22. Copyright violations and large language models. arXiv preprint arXiv:2310.13771, 2023.
  23. Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
  24. Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573, 2022.
  25. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
  26. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
  27. Technical report for iccv 2021 challenge sslad-track3b: Transformers are better continual learners. arXiv preprint arXiv:2201.04924, 2022.
  28. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729, 2023a.
  29. Rethinking machine unlearning for large language models, 2024a.
  30. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023b.
  31. Towards safer large language models through machine unlearning, 2024b.
  32. Large language models relearn removed concepts. arXiv preprint arXiv:2401.01814, 2024.
  33. Investigating bias representations in llama 2 chat via activation steering, 2024.
  34. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
  35. Mechanistic mode connectivity. In International Conference on Machine Learning, pp.  22965–23004. PMLR, 2023.
  36. Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968, 2023.
  37. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024.
  38. A survey of machine unlearning. arXiv preprint arXiv:2209.02299, 2022.
  39. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023.
  40. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
  41. Fine-tuning enhances existing mechanisms: A case study on entity tracking, 2024.
  42. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  43. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
  44. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
  45. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  46. J.K. Rowling. Harry potter series. Bloomsbury Publishing (UK), Scholastic Press (US), 1997-2007. Series includes: Harry Potter and the Sorcerer’s Stone (1997), Harry Potter and the Chamber of Secrets (1998), Harry Potter and the Prisoner of Azkaban (1999), Harry Potter and the Goblet of Fire (2000), Harry Potter and the Order of the Phoenix (2003), Harry Potter and the Half-Blood Prince (2005), and Harry Potter and the Deathly Hallows (2007).
  47. Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024.
  48. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6107–6122, 2022.
  49. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  50. Exploring the landscape of machine unlearning: A comprehensive survey and taxonomy. arXiv preprint arXiv:2305.06360, 2023.
  51. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023.
  52. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  53. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  54. Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766, 2023.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  56. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  57. A language model’s guide through latent space, 2024.
  58. Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
  59. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  60. Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024.
  61. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138, 2023.
  62. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  63. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  64. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  6032–6048, 2023.
  65. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
  66. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023.
  67. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  68. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aengus Lynch (8 papers)
  2. Phillip Guo (5 papers)
  3. Aidan Ewart (5 papers)
  4. Stephen Casper (40 papers)
  5. Dylan Hadfield-Menell (54 papers)
Citations (36)