Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs

Published 2 Sep 2025 in cs.LG | (2509.02820v1)

Abstract: Unlearning in LLMs involves precisely removing specific information from a pre-trained model. This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training. However, existing unlearning methods often fall short when subjected to thorough evaluation. To overcome this, we introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets for more stable and effective unlearning dynamics compared to commonly used loss functions. In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning. Additionally, for a precise unlearning evaluation, we introduce LKF, a curated dataset of lesser-known facts that provides a realistic unlearning scenario. Finally, to comprehensively test unlearning methods, we propose (i) employing an LLM as semantic judge instead of the standard ROUGE score, and (ii) using worst-case unlearning evaluation over various paraphrases and input formats. Our improved evaluation framework reveals that many existing methods are less effective than previously thought.

Abstract PDF Upgrade to Chat

Summary

The paper presents JensUn, a method leveraging Jensen-Shannon Divergence to achieve utility-preserving and robust unlearning of LLMs under rigorous evaluation.
It introduces a novel evaluation framework using LLM-based semantic judges and worst-case paraphrase assessments to accurately measure forgetting.
Empirical results show that JensUn effectively eliminates targeted knowledge while maintaining high performance on retain sets and resisting benign relearning.

Utility-Preserving, Robust, and Almost Irreversible Unlearning in LLMs: The JensUn Approach

Introduction and Motivation

The proliferation of LLMs has intensified the need for effective machine unlearning—removing specific knowledge from a model without retraining from scratch. This is critical for privacy, copyright compliance, and safety, especially when models inadvertently memorize sensitive or harmful information. However, existing unlearning methods often fail under rigorous evaluation, either leaving residual knowledge or degrading model utility. The paper introduces JensUn, a method leveraging Jensen-Shannon Divergence (JSD) as the core loss for both forget and retain sets, aiming to achieve stable, utility-preserving, and robust unlearning. The work also proposes a comprehensive evaluation framework, including a new dataset (LKF), semantic LLM-based judges, and worst-case paraphrase-based assessments.

The JensUn Method: JSD-Based Unlearning

JensUn formulates unlearning as a multi-objective optimization over a forget set $\mathcal{D}_F$ and a retain set $\mathcal{D}_R$ :

$\mathcal{L}_{\text{unlearning}}(\theta) = \lambda_\mathcal{F} L_\mathcal{F}(\theta, \mathcal{D}_F) + \lambda_\mathcal{R} L_\mathcal{R}(\theta, \mathcal{D}_R)$

Forget Loss ( $L_\mathcal{F}$ ): Minimize the JSD between the model output and a fixed target (e.g., a refusal string like "No idea" or a non-informative token sequence) for each token in the forget set.
Retain Loss ( $L_\mathcal{R}$ ): Minimize the JSD between the unlearned model's output and the base model's output on the retain set, preserving general capabilities.

The JSD is symmetric and bounded, providing stable gradients and preventing catastrophic forgetting or utility collapse. This enables longer, more stable fine-tuning compared to unbounded losses (e.g., cross-entropy or KL-divergence), which can destabilize the model.

Figure 1: JensUn yields the best trade-off between unlearning quality (forget set accuracy) and utility of the LLM.

Implementation Details

Target Selection: The target for the forget loss can be a refusal string, whitespace, or random tokens. Empirically, all yield similar unlearning efficacy, allowing for application-specific customization.
Optimization: Standard AdamW optimizer with cosine learning rate scheduling is used. Hyperparameters $\lambda_\mathcal{F}$ and $\lambda_\mathcal{R}$ are tuned to balance forgetting and utility.
Retain Set Construction: For QA-based datasets, the retain set should be semantically related but disjoint from the forget set to avoid leakage.
Training Regime: Fine-tuning is performed for 10–60 epochs, with or without paraphrased queries, depending on the dataset.

Evaluation Framework: Beyond ROUGE

The paper identifies critical flaws in standard unlearning evaluation protocols, particularly the use of ROUGE-L, which is insensitive to semantic correctness and paraphrasing.

Figure 2: ROUGE-L can misjudge unlearning success; LLM-based semantic judges align better with human evaluation.

Key Evaluation Innovations

LLM-Judge: An LLM (e.g., Gemini-2.5-Flash) is used as a semantic judge, providing binary correctness labels for model outputs. This approach shows near-optimal agreement with human raters and avoids the pitfalls of surface-level metrics.
Worst-Case Paraphrase Evaluation: For each forget query, multiple paraphrases are generated using diverse LLMs. Forgetting is only considered successful if the model fails on all paraphrases, capturing robustness to prompt variation.
In-Context Retain Augmentation: Retain set queries are added as in-context examples to further stress-test the model's ability to forget.
New Dataset (LKF): The Lesser-Known Facts dataset targets non-trivial, real-world facts, avoiding the limitations of fictional or dichotomous QA benchmarks.
Figure 3: JensUn forms the Pareto front in forget-utility trade-off for different utility measures on the LKF dataset.

Empirical Results

Forget-Utility Trade-off

JensUn consistently achieves the lowest forget set accuracy (0% in LKF, 6.1% in RWKU QA) while maintaining high utility (measured by MMLU, response repetitiveness, and win rate vs. the base model). Competing methods either fail to fully forget or suffer significant utility degradation.

GradAscent/GradDiff: Can achieve low forget accuracy but often collapse utility, producing degenerate outputs.
NPO/SimNPO: Sensitive to hyperparameters; can either under-forget or degrade utility.
JensUn: Bounded loss ensures stable optimization, allowing for longer unlearning without utility collapse.

Robustness to Benign Relearning

A critical property for practical unlearning is resistance to benign relearning—where the model is fine-tuned on unrelated data post-unlearning. JensUn demonstrates strong resilience: after extensive relearning, forget set accuracy remains low, and utility is preserved, unlike NPO or NPO+SAM, which are vulnerable to knowledge recovery.

Figure 4: Forget set accuracy saturates after certain relearning steps; JensUn resists benign relearning more effectively than alternatives.

Evaluation Protocol Impact

Switching from ROUGE to LLM-Judge and worst-case paraphrase evaluation reveals that prior benchmarks overestimated unlearning quality by up to 43%. This exposes the inadequacy of previous protocols and underscores the necessity of robust, adversarial evaluation.

Figure 5: Worst-case evaluation over paraphrases and in-context retain samples significantly increases measured forget set accuracy, providing a more stringent assessment.

Figure 6: Diversity in paraphrase generation is crucial; worst-case over paraphrases reveals hidden residual knowledge.

Training Dynamics

JensUn's training curves show that forget set accuracy drops to zero rapidly, but optimal retain set performance is only recovered after further fine-tuning, highlighting the importance of balanced loss scheduling.

Figure 7: Training dynamics of JensUn show rapid forgetting followed by gradual recovery of retain set utility.

Practical and Theoretical Implications

Deployment: JensUn enables LLM providers to remove specific knowledge with minimal impact on general capabilities, supporting compliance with privacy regulations and user requests for data removal.
Evaluation: The proposed framework sets a new standard for unlearning assessment, emphasizing semantic correctness and adversarial robustness.
Scalability: The method is compatible with standard LLM fine-tuning pipelines and does not require architectural changes or retraining from scratch.
Limitations: The approach assumes access to a well-defined forget/retain split and may require substantial paraphrase generation for robust evaluation. Computational cost is dominated by LLM-Judge inference and paraphrase generation, but these are parallelizable.

Future Directions

Automated Paraphrase Generation: Integrating more diverse and adversarial paraphrasing strategies could further strengthen evaluation.
Unlearning at Scale: Extending JensUn to larger models and more complex knowledge structures (e.g., multi-hop reasoning, code) remains an open challenge.
Theoretical Guarantees: Formalizing irreversibility and robustness properties under various attack models is a promising direction.
Integration with Model Editing: Combining unlearning with targeted model editing could enable fine-grained control over LLM knowledge.

Conclusion

JensUn establishes a new state-of-the-art in LLM unlearning by leveraging the boundedness and stability of the Jensen-Shannon Divergence, achieving superior forget-utility trade-offs and robustness to relearning. The accompanying evaluation framework exposes the limitations of prior benchmarks and provides a rigorous, semantically grounded protocol for future research. This work lays the foundation for trustworthy, utility-preserving unlearning in large-scale LLMs, a prerequisite for their safe and responsible deployment.

Markdown