Logit-Subtraction: A Method for LLM Unlearning
- The paper introduces logit subtraction, a novel paradigm that suppresses sensitive knowledge via algebraic manipulation of pre-softmax logits.
- The paper leverages auxiliary models to target forgetting without extensive fine-tuning, achieving high forgetting quality (F.Q. ≈ 0.99) while maintaining task utility.
- The paper demonstrates that contrastive and offset methods enable efficient, modular unlearning, though challenges remain in inference latency and hyperparameter tuning.
Logit-subtraction is a paradigm for LLM unlearning in which sensitive knowledge is suppressed through algebraic manipulation of pre-softmax logits, rather than further fine-tuning billions of parameters. This approach enables targeted forgetting while preserving general utility and minimizing catastrophic forgetting. Unlike traditional fine-tuning–based unlearning, logit-subtraction enables efficient, modular, and often inference-time remediation of privacy, copyright, or safety violations. The core idea is to leverage one or more auxiliary models (“assistants” or “offset” models) whose logits encode knowledge of the forget and/or retain sets, and to remove or neutralize the undesired contributions by subtractive or contrastive combinations at the logit level (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).
1. Theoretical Foundation and Motivation
Conventional LLM unlearning is formalized by specifying a “forget” set and (optionally) a “retain” set , then aiming for an updated model that fails to recall information in but retains its general capabilities on and downstream tasks. Most standard techniques optimize an objective of the form
with , being cross-entropy losses on and . This framework is vulnerable to two canonical pathologies:
- Unbounded forgetting: Maximizing can cause unbounded logit growth, producing degenerate, repetitive, or pathological outputs.
- Catastrophic generalization loss: The retain set is typically much smaller than the original pretraining set, so minimizing fails to fully preserve general-purpose capabilities, resulting in severe utility degradation.
Logit-subtraction methods decouple forgetting from global model editing by exploiting the additivity of logits and formulating unlearning as a form of expert composition. By combining the original model and appropriately trained or fine-tuned assistant models at the logit level, these approaches can suppress targeted knowledge without global retraining (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).
2. Logit-Subtraction Mechanisms and Formulations
2.1 Unlearning from Logit Difference (ULD)
The ULD framework (Ji et al., 2024) introduces an assistant LLM, , trained with a reversed unlearning objective: it remembers and forgets . The core logit-subtraction formula is:
where and are the pre-softmax logits of the original (target) and assistant models, respectively, and controls suppression strength. The assistant is optimized by:
where is a standard cross-entropy on the forget set, and maximizes entropy (drives to uniform output) on the retain set.
After training, inference proceeds by subtracting assistant logits from the original model and softmaxing, guaranteeing depressed token probabilities for forget set completions, but unchanged probabilities for retain set completions.
2.2 Contrastive Decoding for Inference-Time Unlearning (UCD)
UCD (Suriyakumar et al., 12 Jun 2025) generalizes logit subtraction to inference time using two auxiliary models, (forget model) and (retain model), fine-tuned on and , respectively. At each timestep , the unlearning-adjusted logits are
where is the original model's logit, and is a tunable suppression hyperparameter. Greedy, top-, or temperature sampling can then be applied to . This approach requires no model updates and can be layered over any existing model checkpoint.
A variant termed “contrastive suppression” clips negative differences, i.e., only reduces the probability of sensitive tokens:
2.3 Offset-Unlearning for Black-Box LLMs
-Unlearning (Huang et al., 2024) addresses black-box scenarios by learning a logit offset from a pair of smaller white-box models and , both initialized identically. is fine-tuned for unlearning, yielding
The ensemble logit for the black-box LLM is then:
at each generation step (with typically ). This construction allows modular, privacy-preserving, plug-and-play unlearning, with ensemble probabilities interpreted as product-of-experts:
3. Algorithmic Procedures and Implementation
3.1 ULD Algorithm Steps (Ji et al., 2024)
- Initialize Assistant Model: Build an assistant by reusing the first layers and output head of the target LLM. Freeze all but LoRA adapters.
- Data Augmentation: Augment forget examples via paraphrase and retain examples via perturbation.
- Train Assistant: Alternate batches from (minimize CE) and (maximize entropy), updating assistant parameters.
- Unlearning via Subtraction: At inference, compute and use to generate output.
3.2 UCD Procedure (Suriyakumar et al., 12 Jun 2025)
At decoding time, for each step:
- Compute logits for the base model , forget model , and retain model .
- Compute the contrastive difference .
- Subtract from to get .
- Apply temperature and sample from .
3.3 -Unlearning Workflow (Huang et al., 2024)
- Precompute Logits: Query the black-box LLM and both offset models on all unique needed for training.
- Fine-Tune Offset Model: Update using GRADASCENT, GRADDIF, or KL-MIN objectives, always acting on the ensemble's output, not alone.
- Inference: For any query, combine black-box and offset model logits as , sample from .
4. Empirical Validation and Comparative Results
4.1 Forget/Utility Trade-off
- ULD (Ji et al., 2024): On the TOFU benchmark (1% forget task), achieves forgetting quality (F.Q.) (vs. $0.40$ baseline) with drop in retained-task utility (baselines up to utility loss). On Harry Potter copyright text, BLEU/R-L metrics on forgotten text drop to near zero while non-forget PPL holds at (baselines: $15$–$50$).
- UCD (Suriyakumar et al., 12 Jun 2025): On Llama2-13B (TOFU 5\%), full indistinguishability from retraining is achieved, with utility improved from (retrain) to . On Llama2-70B, UCD gives utility (vs. retraining ) while holding forget quality at baseline.
- -Unlearning (Huang et al., 2024): On TOFU, matches retraining () and improves paraphrased truth ratio (TR from ) compared to direct fine-tuning. World-fact accuracy is preserved (RL/P/TR superior to direct FT in $11$ of $12$ cases).
4.2 Efficiency and Scalability
- Training Cost: ULD reduces per-epoch training time by over (uses only 20M LoRA parameters).
- Inference Overhead: All logit-subtraction methods require at least one additional forward pass per input (two for ULD/UCD, two or three for -Unlearning), but the computation is parallelizable. The assistant/offset models are typically much smaller than the main LLM.
- Size Scalability: UCD is demonstrated at scale on Llama2-70B, and -Unlearning is compatible with completely black-box APIs.
4.3 Summary of Method Differences
| Method | # of Extra Models | Training Scope | Black-box Support | Inference Only? |
|---|---|---|---|---|
| ULD (Ji et al., 2024) | 1 (assistant) | LoRA adapters, param- efficient | No | No (but efficient deployment) |
| UCD (Suriyakumar et al., 12 Jun 2025) | 2 (forget/retain) | Auxiliary models | Yes | Yes |
| -Unlearning (Huang et al., 2024) | 2 (offset) | Offset models, LoRA | Yes | Yes |
5. Limitations and Extensions
- Inference Latency: Logit-subtraction methods introduce additional forward passes, increasing decoding latency, but the cost is minor for parameter-efficient offsets or when amortized over batched queries.
- Auxiliary Model Quality: Success depends on the auxiliary models faithfully encoding the forget/retain knowledge. Suboptimal fine-tuning or model misalignment can degrade both forgetting and utility.
- Retain Set Design: All methods are vulnerable to mis-specification or insufficient coverage of , especially if the non-forget generalization set is large or diverse.
- Hyperparameter Sensitivity: The suppression weight and entropy targets must be tuned; aggressive suppression can degrade utility, while weak suppression fails to erase target information (Suriyakumar et al., 12 Jun 2025).
- Generality: Logit-subtraction has been proposed as a general expert-composition paradigm, potentially applicable for sentiment steering, bias removal, and post-hoc factuality correction (Ji et al., 2024).
- Adaptation to Black-box and Large-scale APIs: -Unlearning (Huang et al., 2024) demonstrates that logit-based unlearning is viable with only logits access, not gradient or parameter access, increasing practicability for commercial LLM deployments.
6. Connections to Preceding and Related Approaches
The concept of manipulating model logits for targeted unlearning originates in earlier works on linear filtration for logit-based classifiers (Baumhauer et al., 2020). There, a linear transformation is applied to the final logits to sanitize models post-hoc, especially for class-wise deletion in image or tabular classifiers. The method can be transferred in principle to LLMs by treating each vocabulary token as a removable class and constructing the appropriate filtration matrix, but direct adaptation is computationally intensive for modern vocabulary sizes.
Logit subtraction generalizes these ideas by leveraging parameter-efficient adaptation (LoRA), auxiliary expert models, and contrastive decoding composition. Compared to in-context prompt unlearning or direct model fine-tuning, logit-subtraction achieves a superior balance of efficacy, generality, and privacy compliance, especially where access to full training sets or gradients is unavailable.
7. Practical Guidelines and Extensions
- Model Selection: Auxiliary models should be much smaller than the primary LLM for practical fine-tuning. Empirical results indicate 7B auxiliaries suffice for 13B and 70B base models (Suriyakumar et al., 12 Jun 2025, Huang et al., 2024).
- Loss/Objective Choice: Logit-subtraction supports integration with any unlearning objective including gradient ascent, gradient difference, and KL minimization (Huang et al., 2024).
- Parameter Tuning: For most benchmarks, yields optimal trade-offs; too small under-forgets, too large degrades utility (Suriyakumar et al., 12 Jun 2025).
- Data Augmentation: Effective paraphrase and perturbation strategies for and are critical and may benefit from automation.
- Inference Policy: Deterministic and stochastic decoding strategies are compatible; key forget/utility metrics remain invariant under deterministic strategies (Suriyakumar et al., 12 Jun 2025).
In sum, logit-subtraction has established itself as a powerful, theoretically principled, and empirically validated framework for efficient, privacy-compliant, and utility-preserving LLM unlearning, with direct extensions to broader model editing and steering applications (Ji et al., 2024, Suriyakumar et al., 12 Jun 2025, Huang et al., 2024, Baumhauer et al., 2020).