Unforgettable Generalization in Language Models (2409.02228v1)

Published 3 Sep 2024 in cs.LG and cs.CL

Abstract: When LLMs (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

References (45)

Authors (3)

Eric Zhang (12 papers)
Leshem Chosen (2 papers)
Jacob Andreas (116 papers)

Summary

Unforgettable Generalization in LLMs

The paper "Unforgettable Generalization in LLMs" by Eric Zhang, Leshem Choshen, and Jacob Andreas investigates how fine-tuning LLMs (LMs) with randomized labels for specific tasks affects their ability to "forget" learned capabilities. The key contributions of this paper lie in studying the generalization behavior of forgetting tasks and examining whether such forgetting truly removes knowledge or simply alters the model's surface behavior.

Summary of Key Findings

The authors perform an extensive set of experiments using transformer-based LMs, particularly focusing on the LLaMA2-7B model, to analyze the effects of random-label fine-tuning across a variety of tasks. They introduce two key metrics to quantify forgetting: the "forget gap" and the "forget ratio," allowing a rigorous assessment of how well models forget tasks. Their primary results can be summarized as follows:

Task-Dependent Forgetting:
- The degree of task forgetting is significantly varied. Forgetting generalizes robustly in some tasks, such as entailment classification, where models produce uninformative predictions on new task instances.
- In contrast, tasks requiring physical commonsense reasoning or scientific question answering exhibit limited generalization in forgetting; models trained on random labels still perform well on similar unseen examples from these tasks.
Independence from Data Set Difficulty:
- The efficacy of forgetting does not correlate with the difficulty of the task dataset. Specific tasks, regardless of their difficulty, showed a stronger tendency to retain learned knowledge, highlighting the complex nature of task-specific behavior during the forgetting process.
Predictors of Forgetting Generalization:
- Forgetting generalization is weakly predicted by the confidence levels of the LM's initial task predictions and the variability of LM representations of the training data. Tasks with lower confidence and lower variability in representations tend to forget more effectively.
Cross-Task Forgetting:
- There is a noticeable variability in cross-task forgetting. For example, fine-tuning on science questions with random labels caused those models to retain their capability to answer new science questions but fail more thoroughly on entailment classification.
Shallow Forgetting:
- Even when forgetting generalizes, it appears to be shallow. Linear probes trained on the LMs' representations post-forgetting can still perform the tasks reliably, indicating that the underlying knowledge is not completely eradicated.

Implications and Future Directions

These findings have profound implications for the broader objective of targeted unlearning in LMs. The variability in forgetting across different tasks and the shallowness of the forgetting observed suggest that current fine-tuning methodologies might not be sufficient for robust and reliable forgetting. This challenge points to several future research directions:

Robust Unlearning Techniques:
- The field needs more sophisticated unlearning techniques that can achieve deeper forgetting without merely suppressing surface behaviors. Approaches could involve fundamentally altering model structures or developing new training paradigms that prioritize depth in forgetting.
Predictive Metrics for Forgetting:
- Further paper on predictive metrics such as model confidence and variability in data representation could lead to more effective unlearning processes. By better understanding these characteristics, researchers can tailor fine-tuning regimens to specific tasks and contexts.
Implications for Model Safety and Ethics:
- Effective unlearning has important implications for model safety and ethics, particularly in eliminating undesirable capabilities like generating harmful content. Robust forgetting techniques could enhance user trust and compliance with ethical standards.
Cross-Model and Multi-Task Studies:
- Extending this work to other models and exploring how task forgetting generalizes in multi-task settings can provide a more comprehensive understanding. In the paper, results were consistent across models like GPT-J-6B and GPT-2, yet a broader comparison could yield additional insights.
Real-World Applicability:
- Considering practical implications, future research should test forgetting mechanisms in real-world systems to ensure that theoretical benefits translate to operational improvements.

Conclusion

The paper "Unforgettable Generalization in LLMs" offers a detailed examination of the challenges associated with making LMs forget specific capabilities. The nuanced results underscore the complexity of forgetting and highlight the limitations of current fine-tuning practices. While the findings reveal inconsistencies and shallow forgetting, they also open avenues for future techniques that could more comprehensively and robustly address the unlearning process. This work forms a critical foundation in the ongoing endeavor to make AI systems safer and more ethically aligned.

PDF Markdown

Tweets

https://twitter.com/jacobandreas/status/1843257052296261736

https://twitter.com/JagersbergKnut/status/1832439727452246305

https://twitter.com/GptMaestro/status/1832503323100115435

YouTube

Show All Videos