Inference-Time Unlearning Techniques
- Inference-time unlearning is the process of eliminating specific training data’s influence during inference, enabling real-time privacy compliance.
- Approaches include test-time deletion, decoding-time masking, and auxiliary model adjustments to suppress forgotten content without altering core parameters.
- Robust metrics such as UnleScore, IAM, and ReMI validate unlearning completeness while balancing latency, model performance, and privacy safeguards.
Inference-time unlearning is the process of removing or obviating the influence of targeted training data from a deployed model at query time, without requiring full or partial retraining. This paradigm spans a spectrum of approaches: some architectures permit explicit “test-time deletion” of memory elements; others employ decoding-time masking or penalization to prevent outputting forbidden content. Recent developments include provably efficient mechanisms for supervised learning, generative models, and LLMs, as well as rigorous frameworks for measuring residual memorization risk post unlearning.
1. Principles and Motivation
Inference-time unlearning is driven by the need to comply with privacy regulations such as the Right to Be Forgotten (RTBF), minimize retraining overhead, and ensure real-time service continuity, particularly in machine learning as a service (MLaaS) settings. The core goal is to ensure that, after an unlearning request, the deployed model’s responses are indistinguishable from those of a hypothetical model retrained without the “forget set.” This prohibits adversarial inference of deleted data and prevents privacy leakage, even when offline retraining or log-based edits are impractical.
Design principles include:
- Black-box operability: Most frameworks rely solely on inference-time access to the trained model (sometimes augmented with auxiliary models), rather than modifying internal weights or architectures.
- Strict or probabilistic compliance: Some methods guarantee, via inference consistency certificates, that no query reveals information about deleted data unless retraining intervenes; others offer high-probability guarantees by construction.
- Latency and overhead awareness: Algorithms are tailored to minimize additional latency and computational cost relative to naïve retraining.
2. Shard-Aggregate and Consistency-Certification Mechanisms
A canonical approach in supervised learning is ERASER (Hu et al., 2023), which builds upon shard-aggregate architectures (“SISA”): the dataset is partitioned into K disjoint shards, each with a constituent model. At inference, ensemble voting yields the final prediction. Upon receiving unlearning requests, the ERASER scheduler:
- Defers the retraining of affected shards by aggregating pending requests and serving as many “safe” queries as possible on the old ensemble.
- Employs a black-box inference consistency test: given a query, it determines whether the ensemble’s majority vote could change under any (worst-case) removal of the unlearned elements. If a query’s response is “certified” (i.e., invariant to pending unlearning), it is served immediately; otherwise, retraining is triggered.
Three scheduling policies are employed:
- Immediate-unlearning: retrain shards as soon as an unlearning request arrives.
- Uncertification-triggered: batch retraining only when a query fails consistency.
- Threshold-triggered: permit a tunable fraction of uncertified inferences for lower overhead.
Empirical results on image/classification benchmarks demonstrate up to 99% reduction in average inference latency and 31% fewer retrains compared to baseline retrain-on-every-request systems (Hu et al., 2023).
3. Inference-time Unlearning for Generative and LLMs
Contrastive Decoding and Output Penalty Schemes
In LLMs, parameter modifications for unlearning are prohibitive. Inference-time approaches such as Unlearning via Contrastive Decoding (UCD) (Suriyakumar et al., 12 Jun 2025) employ auxiliary models: one trained on the retain set and one on the forget set. At each decoding step, the logit of the original model is contrastively adjusted to penalize the likelihood of forget-associated outputs: This suppresses tokens linked to forgotten content while preserving high utility on the retain set. Experiments on TOFU and MUSE benchmarks show UCD achieves retrain-level forgetfulness and nearly original performance on retain prompts, without model updates (Suriyakumar et al., 12 Jun 2025).
Further, GUARD (Deng et al., 19 May 2025) introduces dynamic generation-time filtering in LLMs. It augments standard beam search by:
- Classifying prompts for forget relevance,
- Extracting a set of forbidden tokens or spans,
- Imposing adaptive penalties (hard blocking and SBERT-based semantic similarity filtering) on candidate outputs.
GUARD attains strong forget quality (highest p-values in entity unlearning, lowest verbatim recall in copyright tasks) with negligible impact on model fluency or downstream accuracy.
Iterative Verifier-Guided Refinement
In generative models, (Chowdhury et al., 3 Feb 2026) formulates inference-time unlearning as an iterative refinement loop: a frozen base model’s outputs are filtered by a black-box verifier (typically another LLM), which scores responses for forgetting quality. The system adaptively resamples until the response passes a conformal calibrated threshold, guaranteeing with probability that outputs comply with unlearning demands. This mechanism achieves 93% error reduction on privacy benchmarks relative to prior methods, while holding retain-set fidelity near baseline.
4. Explicit Test-Time Deletion in Semi-Parametric and Sparse Architectures
Recent work demonstrates that certain deep architectures can support zero-weight-update, explicit deletion of training examples or classes at inference:
- Semi-parametric models (SPMs) (Zheng et al., 24 Mar 2026): The forward pass fuses query representations with a set of per-training-sample embeddings using an attention mechanism. Test-time removal of any subset from the memory bank (i.e., ) eliminates the corresponding influence, instantly achieving unlearning without retraining. On ImageNet, this reduces prediction gaps to within 11% of retraining and realizes 10–200,000× speed-ups.
- Discrete Key-Value Bottleneck (DKVB) (Shah et al., 2023): In models equipped with sparse, discrete codebooks, unlearning a class involves masking the key-value slots activated by forgotten-class samples. This masking at inference disconnects all representational pathways through which forget-class knowledge flows—empirically reducing class accuracy to chance and incurring <1% degradation on retain classes.
Both approaches exploit the structural locality of representation and avoid parameter updates.
5. Metrics and Verification of Unlearning Completeness
Measurement and auditing of unlearning completeness at inference are critical for both exact and approximate approaches.
- LUCM-Universal Score (“UnleScore”) (Wang et al., 2024): A black-box, sample-level metric computes how the unlearned model’s output confidence for a sample compares (via Gaussian likelihood transforms and logit-change statistics) to that for genuine non-members. It provides high correlation (r=0.83) with true unlearning, 24× faster than membership inference attacks, and can monitor both under- and over-unlearning events in real time.
- IAM (Interpolated Approximate Measurement) (Wang et al., 6 Jun 2025): For each , responses between original, unlearned, and shadow OUT models are interpolated to yield a “membership score” reflecting the degree to which remains memorized. IAM attains state-of-the-art AUC in exact and approximate unlearning, flags sample-level risks, and supports LLMs with single shadow models.
- Membership-fingerprinting (ReMI) (Sula et al., 2024): The ReMI approach uses a differentiable loss to align on-task performance and minimize a membership inference–based privacy loss, with theoretical guarantees based on KL-divergence upper bounds. White-box fingerprinting is preferred for strong fidelity; black-box attacks serve as alternative proxies.
6. Limitations and Failure Modes
Inference-time unlearning has several recognized limitations and failure points:
- Architectural dependence: Explicit deletion is only feasible in models with non-parametric or structurally localizable storage (e.g., SPMs, DKVB).
- Reliance on auxiliary artifacts: Contrastive decoding depends on auxiliary models trained on precise retain/forget splits; GUARD requires robust prompt classifiers and semantic extractors.
- Accumulating staleness in deferred retraining: Frameworks such as ERASER must manage the trade-off between latency, retrain frequency, and risk of privacy leakage when pending unlearning accumulates.
- Measurement artifacts: Unlearning “success” measured by membership inference can overstate completeness if the underlying attack model is mismatched; black-box verifiers in iterative schemes may miss subtle residual signals.
- Prompt-level unlearning insufficiency: In diffusion models, prompt engineering or “instruction-based unlearning” fails to suppress target concepts due to persistent binding in joint text-image representations; cross-attention dynamics remain unperturbed (zhang et al., 2 Apr 2026).
7. Practical Recommendations and Future Research Directions
Best practices for deploying inference-time unlearning include:
- Selecting architectures or serving frameworks (e.g., SISA, SPM, DKVB) that enable efficient, explicit removal of example influence when possible.
- Integrating lightweight sample-level auditing metrics (UnleScore, IAM) as continuous, lifecycle-level monitors to detect under- and over-unlearning events and support group fairness analysis.
- Tuning design axes (e.g., retrain thresholds in ERASER, decoding hyperparameters in GUARD/UCD) for the application’s latency vs. forgetfulness requirements.
- For LLMs and generative systems, combining decoding-level suppression (contrastive adjustment, semantic filtering) with iterative verification feedback for the highest assurance.
- Recognizing architectural and modality barriers: effective unlearning in multimodal or deeply entangled diffusion models will require interventions at deeper parameter or representation levels, not merely at the prompt or token mask.
Ongoing research themes include formalization of compositional unlearning (concurrent removal of multiple concepts), extensions to federated and continual learning, and tighter theoretical analysis of approximation and guarantee hierarchies across architectures and modalities.