Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models (2408.06621v2)

Published 13 Aug 2024 in cs.LG and cs.CL

Abstract: LLMs have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora. However, this poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods that remove sensitive data without retraining from scratch. While Gradient Ascent (GA) is commonly used to unlearn by reducing the likelihood of generating unwanted content, it leads to unstable optimization and catastrophic forgetting of retrained knowledge. We also find that combining GA with low-rank adaptation results in poor trade-offs between computational cost and generative performance. To address these challenges, we propose two novel techniques for robust and efficient unlearning for LLMs. First, we introduce Inverted Hinge loss, which suppresses unwanted tokens while maintaining fluency by boosting the probability of the next most likely token. Second, we develop a data-adaptive initialization for LoRA adapters via low-rank approximation weighted with relative Fisher information, thereby focusing updates on parameters critical for removing targeted knowledge. Experiments on the Training Data Extraction Challenge dataset using GPT-Neo models as well as on the TOFU benchmark with Phi-1.5B and Llama2-7B models demonstrate that our approach effectively removes sensitive information while maintaining reasoning and generative capabilities with minimal impact.

PDF HTML Abstract

Towards Robust and Cost-Efficient Knowledge Unlearning for LLMs

The paper "Towards Robust and Cost-Efficient Knowledge Unlearning for LLMs" by Sungmin Cha, Sungjun Cho, Dasol Hwang, and Moontae Lee introduces two innovative techniques aimed at addressing privacy and copyright challenges in the training of LLMs via an efficient knowledge unlearning framework.

Summary

The paper starts with a discussion on the necessity of machine unlearning to alleviate privacy and copyright issues emerging from the indiscriminate memorization abilities of LLMs. The paper critiques the conventional use of Gradient Ascent (GA) for knowledge unlearning, highlighting that its application leads to unstable optimization and catastrophic forgetting, due to the unbounded nature of cross-entropy loss maximization. Moreover, the combination of GA and low-rank adaptation methods such as LoRA (Low-Rank Adaptation) is found to be computationally suboptimal.

Proposed Techniques

The authors propose two novel approaches:

Inverted Hinge Loss (IHL): This new loss function is designed to enhance the efficiency and stability of unlearning by suppressing the probability of generating unwanted tokens, while maintaining the structure and fluency of the language. The paper provides a thorough gradient analysis, indicating that IHL addresses the issues inherent to GA by focusing gradient updates in a way that reduces unnecessary forgetting and ensures bounded loss.
Fisher-weighted Low-Rank Approximation for LoRA (FLoRA): This method proposes a new initialization for LoRA adapters based on Fisher-weighted low-rank approximation. By focusing updates on parameters pertinent to the data to be forgotten, FLoRA allows faster unlearning and better retention of crucial model knowledge. The initialization effectively filters out parameters responsible for retaining sensitive information, optimizing the computational cost versus performance trade-off.

Experimental Evaluation

The experiments were conducted across various model sizes (GPT-Neo 125M, 1.3B, and 2.7B) and utilized the Training Data Extraction Challenge (TDEC) dataset. The authors measured unlearning efficacy using Extraction Likelihood (EL) and Memorization Accuracy (MA) and ensured maintenance of model capabilities by evaluating downstream tasks, including classification and dialogue generation.

Findings:

Effectiveness of IHL: The results demonstrated that IHL effectively mitigated the issues associated with GA, providing more stable and rapid unlearning.
Enhanced Performance with FLoRA: Applying FLoRA improved the trade-offs in computational cost and performance, ensuring successful unlearning without significant degradation of the model's reasoning and generative capabilities.
Comparison with Baselines: Both IHL and FLoRA consistently outperformed traditional GA and vanilla LoRA implementations in terms of both unlearning success and the preservation of downstream task performance.

Implications

The robustness and efficiency achieved by these methods indicate significant practical and theoretical implications for future LLM deployment. From a practical standpoint, this framework allows for compliant de-identification efforts in real-world applications where data privacy is paramount. Theoretically, it establishes a groundwork for further exploration in differential privacy and model interpretability within NLP.

Future Directions

Scalability to Other Architectures: Investigating the scalability of IHL and FLoRA across different model architectures such as BERT or T5.
Broader Evaluation Metrics: Expanding the set of evaluation metrics to include aspects such as fairness and bias to ensure holistic unlearning.
Integration with Differential Privacy: Exploring how these methods can be integrated with differential privacy techniques to provide guarantees on the bounds of knowledge retention and forgetting.

In conclusion, the paper contributes valuable insights and methodologies to the growing field of machine unlearning, specifically within the context of LLMs. Its findings are a notable step towards resolving pressing privacy and computational efficiency challenges in contemporary AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sungmin Cha (26 papers)
Sungjun Cho (18 papers)
Dasol Hwang (8 papers)
Moontae Lee (54 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_sungmin_cha/status/1824790803513794771