Papers
Topics
Authors
Recent
Search
2000 character limit reached

Logit-Subtraction: A Method for LLM Unlearning

Updated 14 March 2026
  • The paper introduces logit subtraction, a novel paradigm that suppresses sensitive knowledge via algebraic manipulation of pre-softmax logits.
  • The paper leverages auxiliary models to target forgetting without extensive fine-tuning, achieving high forgetting quality (F.Q. ≈ 0.99) while maintaining task utility.
  • The paper demonstrates that contrastive and offset methods enable efficient, modular unlearning, though challenges remain in inference latency and hyperparameter tuning.

Logit-subtraction is a paradigm for LLM unlearning in which sensitive knowledge is suppressed through algebraic manipulation of pre-softmax logits, rather than further fine-tuning billions of parameters. This approach enables targeted forgetting while preserving general utility and minimizing catastrophic forgetting. Unlike traditional fine-tuning–based unlearning, logit-subtraction enables efficient, modular, and often inference-time remediation of privacy, copyright, or safety violations. The core idea is to leverage one or more auxiliary models (“assistants” or “offset” models) whose logits encode knowledge of the forget and/or retain sets, and to remove or neutralize the undesired contributions by subtractive or contrastive combinations at the logit level (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).

1. Theoretical Foundation and Motivation

Conventional LLM unlearning is formalized by specifying a “forget” set Df\mathcal{D}_f and (optionally) a “retain” set Dr\mathcal{D}_r, then aiming for an updated model p(yx;θ)p(y|x;\theta') that fails to recall information in Df\mathcal{D}_f but retains its general capabilities on Dr\mathcal{D}_r and downstream tasks. Most standard techniques optimize an objective of the form

minθ[Lf(θ)+βLr(θ)]\min_{\theta'} \left[ -L_f(\theta') + \beta L_r(\theta') \right]

with LfL_f, LrL_r being cross-entropy losses on Df\mathcal{D}_f and Dr\mathcal{D}_r. This framework is vulnerable to two canonical pathologies:

  • Unbounded forgetting: Maximizing LfL_f can cause unbounded logit growth, producing degenerate, repetitive, or pathological outputs.
  • Catastrophic generalization loss: The retain set Dr\mathcal{D}_r is typically much smaller than the original pretraining set, so minimizing LrL_r fails to fully preserve general-purpose capabilities, resulting in severe utility degradation.

Logit-subtraction methods decouple forgetting from global model editing by exploiting the additivity of logits and formulating unlearning as a form of expert composition. By combining the original model and appropriately trained or fine-tuned assistant models at the logit level, these approaches can suppress targeted knowledge without global retraining (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).

2. Logit-Subtraction Mechanisms and Formulations

2.1 Unlearning from Logit Difference (ULD)

The ULD framework (Ji et al., 2024) introduces an assistant LLM, q(yx;ϕ)q(y|x;\phi), trained with a reversed unlearning objective: it remembers Df\mathcal{D}_f and forgets Dr\mathcal{D}_r. The core logit-subtraction formula is:

lunlearn(yx)=ltarget(yx)αlaux(yx)l_{\mathrm{unlearn}}(y|x) = l_{\mathrm{target}}(y|x) - \alpha \cdot l_{\mathrm{aux}}(y|x)

where ltargetl_{\mathrm{target}} and lauxl_{\mathrm{aux}} are the pre-softmax logits of the original (target) and assistant models, respectively, and α>0\alpha > 0 controls suppression strength. The assistant is optimized by:

minϕ[Lf(ϕ)βLr(ϕ)]\min_{\phi} \left[ L_f(\phi) - \beta L_r(\phi) \right]

where LfL_f is a standard cross-entropy on the forget set, and LrL_r maximizes entropy (drives to uniform output) on the retain set.

After training, inference proceeds by subtracting assistant logits from the original model and softmaxing, guaranteeing depressed token probabilities for forget set completions, but unchanged probabilities for retain set completions.

2.2 Contrastive Decoding for Inference-Time Unlearning (UCD)

UCD (Suriyakumar et al., 12 Jun 2025) generalizes logit subtraction to inference time using two auxiliary models, MfM_f (forget model) and MrM_r (retain model), fine-tuned on Df\mathcal{D}_f and Dr\mathcal{D}_r, respectively. At each timestep tt, the unlearning-adjusted logits are

u(vx<t)=0(vx<t)α(f(vx<t)r(vx<t))\ell_u(v | x_{<t}) = \ell_0(v | x_{<t}) - \alpha \big( \ell_f(v | x_{<t}) - \ell_r(v | x_{<t}) \big)

where 0()\ell_0(\cdot) is the original model's logit, and α\alpha is a tunable suppression hyperparameter. Greedy, top-pp, or temperature sampling can then be applied to softmax(u)\mathrm{softmax}(\ell_u). This approach requires no model updates and can be layered over any existing model checkpoint.

A variant termed “contrastive suppression” clips negative differences, i.e., only reduces the probability of sensitive tokens:

u(vx<t)=0(vx<t)αmax(f(vx<t)r(vx<t),0)\ell_u(v | x_{<t}) = \ell_0(v | x_{<t}) - \alpha \max\big( \ell_f(v | x_{<t}) - \ell_r(v | x_{<t}), 0 \big)

2.3 Offset-Unlearning for Black-Box LLMs

δ\delta-Unlearning (Huang et al., 2024) addresses black-box scenarios by learning a logit offset from a pair of smaller white-box models MoM_o and MoM_o', both initialized identically. MoM_o' is fine-tuned for unlearning, yielding

δ(x):=lMo(x)lMo(x)\delta(x) := l_{M_o'}(x) - l_{M_o}(x)

The ensemble logit for the black-box LLM MM is then:

le(x)=lM(x)+αδ(x)l_e(x) = l_M(x) + \alpha \cdot \delta(x)

at each generation step (with α\alpha typically =1=1). This construction allows modular, privacy-preserving, plug-and-play unlearning, with ensemble probabilities interpreted as product-of-experts:

Pe(ytq,y<t)PM(ytq,y<t)×(PMo(ytq,y<t)PMo(ytq,y<t))αP_e(y_t|q, y_{<t}) \propto P_M(y_t|q, y_{<t}) \times \left( \frac{P_{M_o'}(y_t|q, y_{<t})}{P_{M_o}(y_t|q, y_{<t})} \right)^{\alpha}

3. Algorithmic Procedures and Implementation

  1. Initialize Assistant Model: Build an assistant by reusing the first KK layers and output head of the target LLM. Freeze all but LoRA adapters.
  2. Data Augmentation: Augment forget examples via paraphrase and retain examples via perturbation.
  3. Train Assistant: Alternate batches from Df\mathcal{D}_f' (minimize CE) and Dr\mathcal{D}_r' (maximize entropy), updating assistant parameters.
  4. Unlearning via Subtraction: At inference, compute lunlearn=ltargetαlauxl_{\mathrm{unlearn}} = l_{\mathrm{target}} - \alpha l_{\mathrm{aux}} and use softmax\mathrm{softmax} to generate output.

At decoding time, for each step:

  • Compute logits for the base model 0\ell_0, forget model f\ell_f, and retain model r\ell_r.
  • Compute the contrastive difference Δ=fr\Delta\ell = \ell_f - \ell_r.
  • Subtract αΔ\alpha\Delta\ell from 0\ell_0 to get u\ell_u.
  • Apply temperature and sample from softmax(u)\mathrm{softmax}(\ell_u).
  1. Precompute Logits: Query the black-box LLM and both offset models on all unique (q,y<t)(q, y_{<t}) needed for training.
  2. Fine-Tune Offset Model: Update MoM_o' using GRADASCENT, GRADDIF, or KL-MIN objectives, always acting on the ensemble's output, not MoM_o' alone.
  3. Inference: For any query, combine black-box and offset model logits as le(x)l_e(x), sample from softmax(le)\mathrm{softmax}(l_e).

4. Empirical Validation and Comparative Results

4.1 Forget/Utility Trade-off

  • ULD (Ji et al., 2024): On the TOFU benchmark (1% forget task), achieves forgetting quality (F.Q.) 0.99\approx 0.99 (vs. $0.40$ baseline) with 0%0\% drop in retained-task utility (baselines up to 17%17\% utility loss). On Harry Potter copyright text, BLEU/R-L metrics on forgotten text drop to near zero while non-forget PPL holds at 9.95\approx 9.95 (baselines: $15$–$50$).
  • UCD (Suriyakumar et al., 12 Jun 2025): On Llama2-13B (TOFU 5\%), full indistinguishability from retraining is achieved, with utility improved from 45%45\% (retrain) to 52%52\%. On Llama2-70B, UCD gives utility 62%\approx 62\% (vs. retraining 45%45\%) while holding forget quality at baseline.
  • δ\delta-Unlearning (Huang et al., 2024): On TOFU, matches retraining (RLForget39\textrm{RL}_{\mathrm{Forget}} \approx 39) and improves paraphrased truth ratio (TR from 535853 \rightarrow 58) compared to direct fine-tuning. World-fact accuracy is preserved (RL/P/TR superior to direct FT in $11$ of $12$ cases).

4.2 Efficiency and Scalability

  • Training Cost: ULD reduces per-epoch training time by over 3×3\times (uses only \sim20M LoRA parameters).
  • Inference Overhead: All logit-subtraction methods require at least one additional forward pass per input (two for ULD/UCD, two or three for δ\delta-Unlearning), but the computation is parallelizable. The assistant/offset models are typically much smaller than the main LLM.
  • Size Scalability: UCD is demonstrated at scale on Llama2-70B, and δ\delta-Unlearning is compatible with completely black-box APIs.

4.3 Summary of Method Differences

Method # of Extra Models Training Scope Black-box Support Inference Only?
ULD (Ji et al., 2024) 1 (assistant) LoRA adapters, param- efficient No No (but efficient deployment)
UCD (Suriyakumar et al., 12 Jun 2025) 2 (forget/retain) Auxiliary models Yes Yes
δ\delta-Unlearning (Huang et al., 2024) 2 (offset) Offset models, LoRA Yes Yes

5. Limitations and Extensions

  • Inference Latency: Logit-subtraction methods introduce additional forward passes, increasing decoding latency, but the cost is minor for parameter-efficient offsets or when amortized over batched queries.
  • Auxiliary Model Quality: Success depends on the auxiliary models faithfully encoding the forget/retain knowledge. Suboptimal fine-tuning or model misalignment can degrade both forgetting and utility.
  • Retain Set Design: All methods are vulnerable to mis-specification or insufficient coverage of Dr\mathcal{D}_r, especially if the non-forget generalization set is large or diverse.
  • Hyperparameter Sensitivity: The suppression weight α\alpha and entropy targets must be tuned; aggressive suppression can degrade utility, while weak suppression fails to erase target information (Suriyakumar et al., 12 Jun 2025).
  • Generality: Logit-subtraction has been proposed as a general expert-composition paradigm, potentially applicable for sentiment steering, bias removal, and post-hoc factuality correction (Ji et al., 2024).
  • Adaptation to Black-box and Large-scale APIs: δ\delta-Unlearning (Huang et al., 2024) demonstrates that logit-based unlearning is viable with only logits access, not gradient or parameter access, increasing practicability for commercial LLM deployments.

The concept of manipulating model logits for targeted unlearning originates in earlier works on linear filtration for logit-based classifiers (Baumhauer et al., 2020). There, a linear transformation is applied to the final logits to sanitize models post-hoc, especially for class-wise deletion in image or tabular classifiers. The method can be transferred in principle to LLMs by treating each vocabulary token as a removable class and constructing the appropriate filtration matrix, but direct adaptation is computationally intensive for modern vocabulary sizes.

Logit subtraction generalizes these ideas by leveraging parameter-efficient adaptation (LoRA), auxiliary expert models, and contrastive decoding composition. Compared to in-context prompt unlearning or direct model fine-tuning, logit-subtraction achieves a superior balance of efficacy, generality, and privacy compliance, especially where access to full training sets or gradients is unavailable.

7. Practical Guidelines and Extensions

  • Model Selection: Auxiliary models should be much smaller than the primary LLM for practical fine-tuning. Empirical results indicate 7B auxiliaries suffice for 13B and 70B base models (Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).
  • Loss/Objective Choice: Logit-subtraction supports integration with any unlearning objective including gradient ascent, gradient difference, and KL minimization (Huang et al., 2024).
  • Parameter Tuning: For most benchmarks, α0.51.0\alpha\approx 0.5-1.0 yields optimal trade-offs; too small under-forgets, too large degrades utility (Suriyakumar et al., 12 Jun 2025).
  • Data Augmentation: Effective paraphrase and perturbation strategies for Df\mathcal{D}_f' and Dr\mathcal{D}_r' are critical and may benefit from automation.
  • Inference Policy: Deterministic and stochastic decoding strategies are compatible; key forget/utility metrics remain invariant under deterministic strategies (Suriyakumar et al., 12 Jun 2025).

In sum, logit-subtraction has established itself as a powerful, theoretically principled, and empirically validated framework for efficient, privacy-compliant, and utility-preserving LLM unlearning, with direct extensions to broader model editing and steering applications (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024, Baumhauer et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Logit-Subtraction for LLM Unlearning.