Not All Tokens Are Meant to Be Forgotten (2506.03142v1)

Published 3 Jun 2025 in cs.LG

Abstract: LLMs, pre-trained on massive text corpora, exhibit remarkable human-level language understanding, reasoning, and decision-making abilities. However, they tend to memorize unwanted information, such as private or copyrighted content, raising significant privacy and legal concerns. Unlearning has emerged as a promising solution, but existing methods face a significant challenge of over-forgetting. This issue arises because they indiscriminately suppress the generation of all the tokens in forget samples, leading to a substantial loss of model utility. To overcome this challenge, we introduce the Targeted Information Forgetting (TIF) framework, which consists of (1) a flexible targeted information identifier designed to differentiate between unwanted words (UW) and general words (GW) in the forget samples, and (2) a novel Targeted Preference Optimization approach that leverages Logit Preference Loss to unlearn unwanted information associated with UW and Preservation Loss to retain general information in GW, effectively improving the unlearning process while mitigating utility degradation. Extensive experiments on the TOFU and MUSE benchmarks demonstrate that the proposed TIF framework enhances unlearning effectiveness while preserving model utility and achieving state-of-the-art results.

Authors (6)

Xiangyu Zhou (51 papers)
Yao Qiang (16 papers)
Saleh Zare Zade (3 papers)
Douglas Zytko (10 papers)
Prashant Khanduri (29 papers)
Dongxiao Zhu (41 papers)

Summary

This paper, "Not All Tokens Are Meant to Be Forgotten" (Zhou et al., 3 Jun 2025 ), addresses the significant challenge of over-forgetting in LLM unlearning. Existing unlearning methods often indiscriminately suppress the generation of all tokens within forget samples, which leads to a substantial loss of model utility. To counteract this, the authors propose the Targeted Information Forgetting (TIF) framework, designed to differentiate between unwanted information (associated with Unwanted Words, UW) and general information (associated with General Words, GW) within forget samples and selectively unlearn only the unwanted parts.

The TIF framework consists of two main components:

Targeted Information Identifier: This component is responsible for distinguishing UW from GW in the forget samples. The paper explores two practical approaches for this identification:
- Discriminative Encoder-Only LM: Using a masked LLM like DistilBERT. For each word $w_i$ in the forget sequence $y_f$ , it is replaced with a [MASK] token. The model then predicts the masked token given the context $x_f$ and the masked sequence. If the prediction matches the original word $w_i$ , it's classified as a GW; otherwise, it's a UW. This approach is highlighted as computationally efficient and scalable for large datasets.
- Generative Decoder-Only LM: Utilizing powerful generative models like ChatGPT-4. This approach leverages the model's semantic understanding to directly identify UW and GW based on task-specific instructions, as detailed in Appendix A. This method is found to be more effective in balancing forget quality and model utility but is limited by the context window size of the generative model, making it more suitable for smaller datasets like TOFU compared to larger ones like MUSE.
Targeted Preference Optimization (TPO): This is a novel optimization approach designed to replace standard negative preference optimization (NPO) for targeted unlearning. The motivation stems from the observation that NPO, even when combined with an information identifier (NPO-GPT), still suffers significant utility degradation, particularly with larger forget sets. This is attributed to NPO's tendency to alter the logits of non-target tokens when reducing the probability of target tokens, thus distorting the overall logit distribution and inadvertently affecting general information. TPO addresses this with two key loss terms:
- Preservation Loss (PL): A cross-entropy loss applied to the GW ( $\bar{y}$ ) part of the forget sample $\xi_f = (x_f, y_f)$ . This explicitly retrains the model on GW, preventing the forgetting of general information. The objective is to minimize $-\log P_{\boldsymbol{\theta}}(\bar{y}|x_f)$ .
- Logit Preference Loss (LPL): Applied to the UW ( $\hat{y}$ ) part. Instead of directly reducing probabilities via softmax, LPL focuses on selectively reducing the logits of UW by enforcing a preference between the unlearned model $\mathcal{M}_{\boldsymbol{\theta}}$ and the reference model $\mathcal{M}_{\text{\bf ref}}$ . The objective is to maximize the difference between the reference model's logits and the unlearned model's logits for UW. The formula is $-\frac{2}{\beta}\log\sigma \big(\beta (z_{\text{\bf ref}}(\hat{y}|x_f) - z_{\boldsymbol{\theta}}(\hat{y}|x_f)\big)$ . Here, $z_{\boldsymbol{\theta}}(\hat{y}|x_f)$ represents the logits for the UW tokens.

The full TPO objective combines these two losses: $\mathbb{E}_{\xi_f\sim D_f}\bigg[ \! -\frac{2}{\beta}\log\sigma\big(\beta (z_{\text{\bf ref}}(\hat{y}|x_f) - z_{\boldsymbol{\theta}}(\hat{y}|x_f)\big)_{\text{LPL} - \lambda \log P_{\boldsymbol{\theta}}(\bar{y} | x_f)}_{\text{PL} \!\bigg]$, where $\lambda$ is a tuning weight for the Preservation Loss.

The framework was evaluated on two standard LLM unlearning benchmarks:

MUSE: Focused on unlearning copyrighted content (Harry Potter books, news articles). Metrics included VerbMem, KnowMem (ROUGE-L F1), and PrivLeak (Min-K% Prob). Experiments used ICLM-7B and LLaMA-2 7B. Due to the large data size, the discriminative (DistilBERT) identifier was used. The results show that TPO, particularly when combined with Gradient Descent on Retain (GDR) loss (TPO $_{\text{GDR}}$ ), consistently achieves PrivLeak values closest to 0, indicating a better balance with the retained model compared to baselines like GA, NPO, and SimNPO. GDR is shown to significantly improve utility preservation on large datasets like MUSE.
TOFU: A synthetic dataset for fictitious unlearning (author biographies). Metrics included Forget Quality (KS p-value, Truth Ratio on $\mathcal{D}_f$ ) and Model Utility (mean of various metrics on retain/real-world knowledge sets). Experiments used LLaMA-2 7B and LLaMA-3.2 3B. The generative (GPT) identifier was used due to dataset size. Results demonstrate that incorporating the unwanted information identifier significantly boosts the performance of all baselines. TPO-GPT consistently achieves high model utility while maintaining strong forget quality, showing the best trade-off, especially as the forget set size increases (10%).

The paper highlights that unwanted information identification is crucial for effective unlearning, and generative LMs currently offer better identification performance than discriminative ones for this task. TPO, with its targeted LPL on UW and PL on GW, effectively mitigates the over-forgetting problem faced by previous preference optimization methods like NPO, leading to state-of-the-art performance in balancing unlearning effectiveness and model utility preservation, particularly under challenging conditions with larger forget sets.

The authors acknowledge limitations, noting that TIF's reliance on identifier accuracy might pose challenges for unlearning conceptually diffuse knowledge (like in WMDP benchmark) where the target isn't easily localized to specific tokens. The current framework focuses on sequence unlearning.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/mhatta/status/1930691138886787125

YouTube

Show All Videos

HackerNews

Not all tokens are meant to be forgotten (54 points, 23 comments)