JensUn: LLM Unlearning Methodology

Updated 9 September 2025

JensUn is a machine unlearning method that precisely removes targeted factual knowledge from large language models using a JSD-based loss function.
It balances forgetting and retention by optimizing hyperparameters, achieving complete erasure on designated facts while preserving general utility on retained data.
The approach is validated through rigorous benchmarks such as the LKF dataset and worst-case paraphrase testing, demonstrating exceptional stability and resistance to benign relearning.

JensUn is a machine unlearning methodology designed for LLMs, enabling the targeted removal (“forgetting”) of specific factual knowledge from a pre-trained model while preserving its general utility across diverse tasks. JensUn employs the Jensen-Shannon Divergence (JSD) as its central loss function, both for driving effective forgetting and ensuring retention of desired capabilities. The approach has been validated through rigorous evaluation on challenging benchmarks, including the newly introduced LKF (“Lesser Known Facts”) dataset, and via an advanced evaluation protocol using worst-case paraphrase testing and LLM-based semantic judgments.

1. Methodological Foundation: JSD-based Unlearning Objective

JensUn modifies LLM parameters to eliminate response proficiency on a designated set of “facts to be forgotten” ( $\mathcal{D}_F$ ), while preventing degradation on a “retain set” ( $\mathcal{D}_R$ ) representing generic knowledge and capabilities.

Instead of conventional objectives based on log-likelihood or unbounded KL divergence, JensUn formalizes its loss as the sum of two JSD terms:

Forget Loss:

$L_F^{JSD}(\theta, \mathcal{D}_F) = \frac{1}{N_F} \sum_{(x,y)\in\mathcal{D}_F} \sum_{t=1}^{|\mathbf{y}_{target}|} JSD\left( p_\theta(y_t | x, y_{target, <t}) \| \delta(y_{target, t}) \right)$ Here, the model’s output distribution is aligned with a one-hot distribution $\delta(y_{target, t})$ for tokens of a prescribed refusal or neutral answer (e.g., “No idea”).

Retain Loss:

$L_R^{JSD}(\theta, \mathcal{D}_R) = \frac{1}{N_R} \sum_{(x,y)\in\mathcal{D}_R} \sum_{t=1}^{|y|} JSD\left( p_\theta(y_t | x, y_{<t}) \| p_{\theta_{ref}}(y_t | x, y_{<t}) \right)$ This maintains similarity to the reference (pre-trained) model on the retain set.

Total Objective:

$L_{JensUn}(\theta, \mathcal{D}_F, \mathcal{D}_R) = \lambda_F \cdot L_F^{JSD}(\theta, \mathcal{D}_F) + \lambda_R \cdot L_R^{JSD}(\theta, \mathcal{D}_R)$

The use of JSD is critical for its symmetry and boundedness ( $0 \leq JSD \leq \log 2$ ), attenuating gradient instability and enabling controlled long-duration fine-tuning. This results in highly stable unlearning dynamics relative to KL- or log-likelihood-based methods.

2. Forget-Utility Trade-off and Pareto Efficiency

A principal goal of machine unlearning is to achieve maximal forgetting of designated information with minimal side effects on model utility. JensUn is constructed to optimize this trade-off via hyperparameters $\lambda_F$ and $\lambda_R$ , which balance the contributions of forgetting and retaining in the loss.

Empirical evidence:

On the LKF benchmark, JensUn achieves a worst-case forget set accuracy ( $\mathcal{J}_W$ ) of $0\%$ , indicating complete factual erasure, while maintaining an MMLU score of $59.9$ and win rate ( $\mathrm{WR}$ ) of approximately $0.47$ (see Table 1 in the referenced paper). Performance on the retain set and generic tasks remains nearly indistinguishable from the base model.

Context:

Trade-off curves presented in the original work place JensUn on or near the Pareto front, outperforming alternative methods such as GradAscent, GradDiff, NPO, RMU, and SimNPO in both forgetting and retention metrics. This suggests that the use of JSD is effective for simultaneously optimizing both objectives.

3. Robustness to Benign Relearning

Robustness against accidental “relearning” of erased content is a hallmark of persistent unlearning. JensUn demonstrates strong persistence against the reacquisition of forgotten facts, even after subsequent fine-tuning (relearning) on disjoint new data:

Experimental findings:
- The win rate remains high, indicating general utility is preserved.
- The model resists inadvertent memory recovery due to the bounded, stable gradients imposed by JSD.

This behavior implies that the erasure is more irreversible than with competing techniques; however, the precise irreversibility depends on continued isolation from the forgotten data.

4. LKF Dataset: Design and Evaluation Rigor

To facilitate objective testing of unlearning mechanisms, the LKF dataset consists of:

Subset	Q–A Pair Count	Topics Included
Forget set	100	Challenger Disaster, Salem Witch Trials, Cod Wars, Krakatoa Eruption, Battle of Talas
Retain set	400	Same as above; distinct Q–A per set

Properties:
- Topics are niche, reducing the probability of correct guesses.
- Avoids dichotomous formats for higher evaluation stringency.
- Retain set is curated to avoid overlap and leakage from the forget set.

This design enhances the realism of privacy-motivated unlearning and increases confidence that measured forgetting is substantive rather than superficial.

5. Evaluation Framework: Semantic and Worst-case Accuracy

JensUn introduces two core innovations for rigorous assessment of unlearning performance:

LLM-based Semantic Judge:

Automatic semantic evaluation is performed using a capable LLM (Gemini-2.5-Flash). Rather than relying on ROUGE scores, which are sensitive to wording, the semantic judge determines factual correctness by answering “YES” or “NO” to whether the candidate response matches ground truth. ROC curve analysis confirms close alignment with human evaluators.

Worst-case Accuracy via Paraphrases and In-context Retain Samples:

Each forget-query is paraphrased into $N_P = 15$ semantically equivalent forms, and queries are paired with in-context retain examples (ICR). The worst-case metric ( $\mathcal{J}_W$ ) is defined as the highest forget-set accuracy over all paraphrased inputs and contexts:

This protocol is strictly harder than evaluating a single prompt and exposes subtle residual knowledge.
Prior methods relying on string similarity or single-format testing are shown to overstate unlearning quality.

A plausible implication is that worst-case metrics may become standard for high-assurance unlearning evaluation.

6. Comparative Benchmarking

JensUn is systematically compared with established baselines—GradAscent, GradDiff, DPO, NPO, RMU, and SimNPO—across LKF and RWKU benchmarks. The enhanced evaluation methodology surfaces weaknesses in prior metrics, such as overestimation of forgetting due to ROUGE artifacts or lack of paraphrase robustness.

Results:
- On LKF, JensUn achieves $\mathcal{J}_W = 0.0$ while matching pre-trained utility scores.
- On RWKU, the lowest forget-set accuracy is achieved across formatting types (factual and QA).
- JensUn maintains retention capabilities and resistance to benign relearning more effectively than all comparators.

This systematic superiority is attributed to the combination of JSD objective and rigorous evaluation.

7. Practical Implications and Future Directions

JensUn defines the state-of-the-art in machine unlearning for LLMs with its combination of:

Algorithmic stability: via bounded loss and gradient scaling,
Granular control: over the forget-utility trade-off,
Persistence: of erasure against post-unlearning updates,
Enhanced evaluation: integrating semantic LLM judgments and worst-case testing.

This suggests that future unlearning research should prioritize both robust, bounded objectives and stringent, semantically-aware evaluation pipelines. The introduction of datasets such as LKF and protocols grounded in worst-case accuracy are likely to influence methodological norms in privacy-preserving machine learning.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to JensUn.