Logit-Subtraction: A Method for LLM Unlearning

Updated 14 March 2026

The paper introduces logit subtraction, a novel paradigm that suppresses sensitive knowledge via algebraic manipulation of pre-softmax logits.
The paper leverages auxiliary models to target forgetting without extensive fine-tuning, achieving high forgetting quality (F.Q. ≈ 0.99) while maintaining task utility.
The paper demonstrates that contrastive and offset methods enable efficient, modular unlearning, though challenges remain in inference latency and hyperparameter tuning.

Logit-subtraction is a paradigm for LLM unlearning in which sensitive knowledge is suppressed through algebraic manipulation of pre-softmax logits, rather than further fine-tuning billions of parameters. This approach enables targeted forgetting while preserving general utility and minimizing catastrophic forgetting. Unlike traditional fine-tuning–based unlearning, logit-subtraction enables efficient, modular, and often inference-time remediation of privacy, copyright, or safety violations. The core idea is to leverage one or more auxiliary models (“assistants” or “offset” models) whose logits encode knowledge of the forget and/or retain sets, and to remove or neutralize the undesired contributions by subtractive or contrastive combinations at the logit level (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).

1. Theoretical Foundation and Motivation

Conventional LLM unlearning is formalized by specifying a “forget” set $\mathcal{D}_f$ and (optionally) a “retain” set $\mathcal{D}_r$ , then aiming for an updated model $p(y|x;\theta')$ that fails to recall information in $\mathcal{D}_f$ but retains its general capabilities on $\mathcal{D}_r$ and downstream tasks. Most standard techniques optimize an objective of the form

$\min_{\theta'} \left[ -L_f(\theta') + \beta L_r(\theta') \right]$

with $L_f$ , $L_r$ being cross-entropy losses on $\mathcal{D}_f$ and $\mathcal{D}_r$ . This framework is vulnerable to two canonical pathologies:

Unbounded forgetting: Maximizing $L_f$ can cause unbounded logit growth, producing degenerate, repetitive, or pathological outputs.
Catastrophic generalization loss: The retain set $\mathcal{D}_r$ is typically much smaller than the original pretraining set, so minimizing $L_r$ fails to fully preserve general-purpose capabilities, resulting in severe utility degradation.

Logit-subtraction methods decouple forgetting from global model editing by exploiting the additivity of logits and formulating unlearning as a form of expert composition. By combining the original model and appropriately trained or fine-tuned assistant models at the logit level, these approaches can suppress targeted knowledge without global retraining (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).

2. Logit-Subtraction Mechanisms and Formulations

2.1 Unlearning from Logit Difference (ULD)

The ULD framework (Ji et al., 2024) introduces an assistant LLM, $q(y|x;\phi)$ , trained with a reversed unlearning objective: it remembers $\mathcal{D}_f$ and forgets $\mathcal{D}_r$ . The core logit-subtraction formula is:

$l_{\mathrm{unlearn}}(y|x) = l_{\mathrm{target}}(y|x) - \alpha \cdot l_{\mathrm{aux}}(y|x)$

where $l_{\mathrm{target}}$ and $l_{\mathrm{aux}}$ are the pre-softmax logits of the original (target) and assistant models, respectively, and $\alpha > 0$ controls suppression strength. The assistant is optimized by:

$\min_{\phi} \left[ L_f(\phi) - \beta L_r(\phi) \right]$

where $L_f$ is a standard cross-entropy on the forget set, and $L_r$ maximizes entropy (drives to uniform output) on the retain set.

After training, inference proceeds by subtracting assistant logits from the original model and softmaxing, guaranteeing depressed token probabilities for forget set completions, but unchanged probabilities for retain set completions.

2.2 Contrastive Decoding for Inference-Time Unlearning (UCD)

UCD (Suriyakumar et al., 12 Jun 2025) generalizes logit subtraction to inference time using two auxiliary models, $M_f$ (forget model) and $M_r$ (retain model), fine-tuned on $\mathcal{D}_f$ and $\mathcal{D}_r$ , respectively. At each timestep $t$ , the unlearning-adjusted logits are

$\ell_u(v | x_{<t}) = \ell_0(v | x_{<t}) - \alpha \big( \ell_f(v | x_{<t}) - \ell_r(v | x_{<t}) \big)$

where $\ell_0(\cdot)$ is the original model's logit, and $\alpha$ is a tunable suppression hyperparameter. Greedy, top- $p$ , or temperature sampling can then be applied to $\mathrm{softmax}(\ell_u)$ . This approach requires no model updates and can be layered over any existing model checkpoint.

A variant termed “contrastive suppression” clips negative differences, i.e., only reduces the probability of sensitive tokens:

$\ell_u(v | x_{<t}) = \ell_0(v | x_{<t}) - \alpha \max\big( \ell_f(v | x_{<t}) - \ell_r(v | x_{<t}), 0 \big)$

2.3 Offset-Unlearning for Black-Box LLMs

$\delta$ -Unlearning (Huang et al., 2024) addresses black-box scenarios by learning a logit offset from a pair of smaller white-box models $M_o$ and $M_o'$ , both initialized identically. $M_o'$ is fine-tuned for unlearning, yielding

$\delta(x) := l_{M_o'}(x) - l_{M_o}(x)$

The ensemble logit for the black-box LLM $M$ is then:

$l_e(x) = l_M(x) + \alpha \cdot \delta(x)$

at each generation step (with $\alpha$ typically $=1$ ). This construction allows modular, privacy-preserving, plug-and-play unlearning, with ensemble probabilities interpreted as product-of-experts:

$P_e(y_t|q, y_{<t}) \propto P_M(y_t|q, y_{<t}) \times \left( \frac{P_{M_o'}(y_t|q, y_{<t})}{P_{M_o}(y_t|q, y_{<t})} \right)^{\alpha}$

3. Algorithmic Procedures and Implementation

Initialize Assistant Model: Build an assistant by reusing the first $K$ layers and output head of the target LLM. Freeze all but LoRA adapters.
Data Augmentation: Augment forget examples via paraphrase and retain examples via perturbation.
Train Assistant: Alternate batches from $\mathcal{D}_f'$ (minimize CE) and $\mathcal{D}_r'$ (maximize entropy), updating assistant parameters.
Unlearning via Subtraction: At inference, compute $l_{\mathrm{unlearn}} = l_{\mathrm{target}} - \alpha l_{\mathrm{aux}}$ and use $\mathrm{softmax}$ to generate output.

At decoding time, for each step:

Compute logits for the base model $\ell_0$ , forget model $\ell_f$ , and retain model $\ell_r$ .
Compute the contrastive difference $\Delta\ell = \ell_f - \ell_r$ .
Subtract $\alpha\Delta\ell$ from $\ell_0$ to get $\ell_u$ .
Apply temperature and sample from $\mathrm{softmax}(\ell_u)$ .

Precompute Logits: Query the black-box LLM and both offset models on all unique $(q, y_{<t})$ needed for training.
Fine-Tune Offset Model: Update $M_o'$ using GRADASCENT, GRADDIF, or KL-MIN objectives, always acting on the ensemble's output, not $M_o'$ alone.
Inference: For any query, combine black-box and offset model logits as $l_e(x)$ , sample from $\mathrm{softmax}(l_e)$ .

4. Empirical Validation and Comparative Results

4.1 Forget/Utility Trade-off

ULD (Ji et al., 2024): On the TOFU benchmark (1% forget task), achieves forgetting quality (F.Q.) $\approx 0.99$ (vs. $0.40$ baseline) with $0\%$ drop in retained-task utility (baselines up to $17\%$ utility loss). On Harry Potter copyright text, BLEU/R-L metrics on forgotten text drop to near zero while non-forget PPL holds at $\approx 9.95$ (baselines: $15$–$50$).
UCD (Suriyakumar et al., 12 Jun 2025): On Llama2-13B (TOFU 5\%), full indistinguishability from retraining is achieved, with utility improved from $45\%$ (retrain) to $52\%$ . On Llama2-70B, UCD gives utility $\approx 62\%$ (vs. retraining $45\%$ ) while holding forget quality at baseline.
$\delta$ -Unlearning (Huang et al., 2024): On TOFU, matches retraining ( $\textrm{RL}_{\mathrm{Forget}} \approx 39$ ) and improves paraphrased truth ratio (TR from $53 \rightarrow 58$ ) compared to direct fine-tuning. World-fact accuracy is preserved (RL/P/TR superior to direct FT in $11$ of $12$ cases).

4.2 Efficiency and Scalability

Training Cost: ULD reduces per-epoch training time by over $3\times$ (uses only $\sim$ 20M LoRA parameters).
Inference Overhead: All logit-subtraction methods require at least one additional forward pass per input (two for ULD/UCD, two or three for $\delta$ -Unlearning), but the computation is parallelizable. The assistant/offset models are typically much smaller than the main LLM.
Size Scalability: UCD is demonstrated at scale on Llama2-70B, and $\delta$ -Unlearning is compatible with completely black-box APIs.

4.3 Summary of Method Differences

Method	# of Extra Models	Training Scope	Black-box Support	Inference Only?
ULD (Ji et al., 2024)	1 (assistant)	LoRA adapters, param- efficient	No	No (but efficient deployment)
UCD (Suriyakumar et al., 12 Jun 2025)	2 (forget/retain)	Auxiliary models	Yes	Yes
$\delta$ -Unlearning (Huang et al., 2024)	2 (offset)	Offset models, LoRA	Yes	Yes

5. Limitations and Extensions

Inference Latency: Logit-subtraction methods introduce additional forward passes, increasing decoding latency, but the cost is minor for parameter-efficient offsets or when amortized over batched queries.
Auxiliary Model Quality: Success depends on the auxiliary models faithfully encoding the forget/retain knowledge. Suboptimal fine-tuning or model misalignment can degrade both forgetting and utility.
Retain Set Design: All methods are vulnerable to mis-specification or insufficient coverage of $\mathcal{D}_r$ , especially if the non-forget generalization set is large or diverse.
Hyperparameter Sensitivity: The suppression weight $\alpha$ and entropy targets must be tuned; aggressive suppression can degrade utility, while weak suppression fails to erase target information (Suriyakumar et al., 12 Jun 2025).
Generality: Logit-subtraction has been proposed as a general expert-composition paradigm, potentially applicable for sentiment steering, bias removal, and post-hoc factuality correction (Ji et al., 2024).
Adaptation to Black-box and Large-scale APIs: $\delta$ -Unlearning (Huang et al., 2024) demonstrates that logit-based unlearning is viable with only logits access, not gradient or parameter access, increasing practicability for commercial LLM deployments.

The concept of manipulating model logits for targeted unlearning originates in earlier works on linear filtration for logit-based classifiers (Baumhauer et al., 2020). There, a linear transformation is applied to the final logits to sanitize models post-hoc, especially for class-wise deletion in image or tabular classifiers. The method can be transferred in principle to LLMs by treating each vocabulary token as a removable class and constructing the appropriate filtration matrix, but direct adaptation is computationally intensive for modern vocabulary sizes.

Logit subtraction generalizes these ideas by leveraging parameter-efficient adaptation (LoRA), auxiliary expert models, and contrastive decoding composition. Compared to in-context prompt unlearning or direct model fine-tuning, logit-subtraction achieves a superior balance of efficacy, generality, and privacy compliance, especially where access to full training sets or gradients is unavailable.

7. Practical Guidelines and Extensions

Model Selection: Auxiliary models should be much smaller than the primary LLM for practical fine-tuning. Empirical results indicate 7B auxiliaries suffice for 13B and 70B base models (Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).
Loss/Objective Choice: Logit-subtraction supports integration with any unlearning objective including gradient ascent, gradient difference, and KL minimization (Huang et al., 2024).
Parameter Tuning: For most benchmarks, $\alpha\approx 0.5-1.0$ yields optimal trade-offs; too small under-forgets, too large degrades utility (Suriyakumar et al., 12 Jun 2025).
Data Augmentation: Effective paraphrase and perturbation strategies for $\mathcal{D}_f'$ and $\mathcal{D}_r'$ are critical and may benefit from automation.
Inference Policy: Deterministic and stochastic decoding strategies are compatible; key forget/utility metrics remain invariant under deterministic strategies (Suriyakumar et al., 12 Jun 2025).

In sum, logit-subtraction has established itself as a powerful, theoretically principled, and empirically validated framework for efficient, privacy-compliant, and utility-preserving LLM unlearning, with direct extensions to broader model editing and steering applications (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024, Baumhauer et al., 2020).

Markdown Report Issue Upgrade to Chat

References (4)

Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference (2024)

UCD: Unlearning in LLMs via Contrastive Decoding (2025)

Offset Unlearning for Large Language Models (2024)

Machine Unlearning: Linear Filtration for Logit-based Classifiers (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Logit-Subtraction for LLM Unlearning.

Logit-Subtraction: A Method for LLM Unlearning

1. Theoretical Foundation and Motivation

2. Logit-Subtraction Mechanisms and Formulations

2.1 Unlearning from Logit Difference (ULD)

2.2 Contrastive Decoding for Inference-Time Unlearning (UCD)

2.3 Offset-Unlearning for Black-Box LLMs

3. Algorithmic Procedures and Implementation

3.1 ULD Algorithm Steps (Ji et al., 2024)

3.2 UCD Procedure (Suriyakumar et al., 12 Jun 2025)

3.3 $\delta$ -Unlearning Workflow (Huang et al., 2024)

4. Empirical Validation and Comparative Results

4.1 Forget/Utility Trade-off

4.2 Efficiency and Scalability

4.3 Summary of Method Differences

5. Limitations and Extensions

7. Practical Guidelines and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Logit-Subtraction: A Method for LLM Unlearning

1. Theoretical Foundation and Motivation

2. Logit-Subtraction Mechanisms and Formulations

2.1 Unlearning from Logit Difference (ULD)

2.2 Contrastive Decoding for Inference-Time Unlearning (UCD)

2.3 Offset-Unlearning for Black-Box LLMs

3. Algorithmic Procedures and Implementation

3.1 ULD Algorithm Steps (Ji et al., 2024)

3.2 UCD Procedure (Suriyakumar et al., 12 Jun 2025)

3.3 δ\deltaδ-Unlearning Workflow (Huang et al., 2024)

4. Empirical Validation and Comparative Results

4.1 Forget/Utility Trade-off

4.2 Efficiency and Scalability

4.3 Summary of Method Differences

5. Limitations and Extensions

6. Connections to Preceding and Related Approaches

7. Practical Guidelines and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

3.3 $\delta$ -Unlearning Workflow (Huang et al., 2024)