Papers
Topics
Authors
Recent
Search
2000 character limit reached

Backdoor Attack via Machine Unlearning

Updated 20 February 2026
  • Backdoor attacks via machine unlearning are adversarial exploits that embed hidden triggers during the unlearning phase, activating only after specific data removal.
  • The methodology involves poisoning training data and orchestrating selective unlearning operations, which dramatically increase the post-unlearning attack success rate.
  • Empirical results show that unlearning operations can shift ASR from 20–30% pre-unlearning to over 85–95% post-unlearning, highlighting a critical vulnerability.

Backdoor Attack via Machine Unlearning

A backdoor attack via machine unlearning is an adversarial exploitation of data deletion mechanisms in machine learning systems, in which an attacker orchestrates both the training phase and subsequent unlearning requests to implant or amplify a hidden backdoor. This attack paradigm diverges from conventional backdoor strategies by leveraging the post-hoc data removal process (machine unlearning), resulting in attacks that are highly stealthy—often evading pre-deployment detection—yet become highly effective only after the target model is altered by subsequent unlearning operations. Multiple distinctive technical constructs, threat models, and workflow realizations have been developed, forming a spectrum from "clean unlearning" backdoors to revocable and deferred-activation variants, across supervised, multimodal, and federated settings (Arazzi et al., 14 Jun 2025, Song et al., 15 Oct 2025, Alam et al., 17 Feb 2025).

1. Threat Models and Formal Definitions

Backdoor attacks via machine unlearning are typically constructed under the following adversarial framework:

  • Attack Goals: Implant a backdoor (e.g., malicious input-to-label association) that is stealthy during normal model evaluation but becomes highly active solely after a model has undergone a specific unlearning operation.
  • Attacker Capabilities: The attacker can (a) poison a subset of training samples pre-deployment, (b) submit unlearning requests—often for "clean" or camouflage samples via standard APIs, and (c) in some paradigms, select, design, or invert triggers and forget-sets, occasionally with shadow/surrogate model access.
  • Unlearning Mechanisms: The victim model M is trained on Dtrain=CDPDD_{train} = CD \cup PD, where CDCD is the clean dataset, and PDPD are the attacker's poisoned samples. Subsequently, the attacker requests unlearning of FDCDFD \subset CD, yielding M=U(M,FD)M' = U(M, FD) for an unlearning operator U()U(\cdot).
  • Attack Activation: Backdoor efficacy, such as the attack success rate (ASR) for targeted triggers, is low before unlearning but surges after forgetting is applied, with negligible loss in clean accuracy (Arazzi et al., 14 Jun 2025, Alam et al., 17 Feb 2025, Zhang et al., 2023).

Formally, for model parameters θ\theta^* post-training, the unlearned model parameters θ\theta' result from solving:

θ=argminθLretain(θ;RD)+Reg(θ;θ,FD)\theta' = \arg\min_{\theta} L_{retain}(\theta; RD) + Reg(\theta; \theta^*, FD)

with RD=CDFDRD = CD \setminus FD (Arazzi et al., 14 Jun 2025). Attack success is quantified by the post-unlearning proportion of trigger-poisoned inputs mapped to the attacker's target label.

2. Attack Methodologies

Several families of attack workflows have been articulated:

2.1 Clean Unlearning Backdoor Attack

The canonical two-phase workflow (Arazzi et al., 14 Jun 2025):

  • Phase 1: Weak, Distributed Poisoning
    • Inject a weak, distributed trigger signal (e.g., δ\delta in the mid-frequency DCT domain) into a small fraction of samples from multiple classes.
    • Poisoned samples chosen by low similarity (via a shadow model) to the target class embedding, avoiding detection signatures.
    • The model learns a weak δLt\delta \to L_t association entangled with benign features.
  • Phase 2: Selective Unlearning
    • The attacker requests forgetting of the clean, unpoisoned variants of previously poisoned samples.
    • Unlearning erases the clean mapping for these examples, unmasking the δLt\delta \to L_t association and dramatically raising post-unlearning ASR.

2.2 Revocable Backdoor Attacks

A revocable backdoor is explicitly engineered to both activate and later erase via machine unlearning (Song et al., 15 Oct 2025):

  • Bilevel Trigger Optimization: The trigger generator is trained via a bilevel problem that simulates both injection and unlearning, with explicit regularization to facilitate erasure.
  • Deterministic Partitioning and PCGrad: Poisoned and unlearning subsets are fixed, and Projected Conflicting Gradient (PCGrad) resolves gradient conflicts between attack and unlearning objectives.
  • Empirical Results: On CIFAR-10, pre-unlearning ASR >98%, with post-unlearning ASR reduced to <13% using Unroll-SGD unlearning.

2.3 Camouflage-Aware Concealed Backdoor

The ReVeil paradigm (Alam et al., 17 Feb 2025) manipulates the data collection pipeline:

  • Camouflage samples (with true labels) are interleaved with regular poison; both carry the trigger.
  • Pre-deployment: Camouflage contrasts the backdoor association, suppressing ASR.
  • Post-unlearning: Authorized removal of camouflage samples restores the attack, with ASR after unlearning ≈ 98–100% and clean accuracy within 1–2% of baseline.

2.4 Unlearning-Triggered Backdoor in Federated/Black-box Settings

Other variants include:

  • Federated Learning Attacks: Malicious clients mask the backdoor during FL training with camouflage samples, then post-deployment unlearning requests activate the backdoor (Lu et al., 21 Aug 2025).
  • Black-Box Machine Unlearning Attacks: Iterative forgetting of mitigation samples (with true labels) causes initially dormant backdoors to emerge (Zhang et al., 2023).

3. Empirical Characterization and Metrics

Comprehensive experiments have been conducted across model families (ResNets, VGG, Vision Transformers, EfficientNets), datasets (CIFAR-10/100, MNIST, FashionMNIST, ImageNet, GTSRB, STL-10), and unlearning mechanisms (Fisher Forgetting, Boundary Unlearning, SISA, fine-tuning, SGD unroll, etc.). Core metrics include:

  • Clean Accuracy (Acc_retain/BA): Maintained to within a few percentage points of a clean model post-unlearning.
  • Forget-set Accuracy (Acc_forget): Drops from baseline to ≈0% post-unlearning, evidencing effective erasure of clean mapping.
  • Attack Success Rate (ASR): Low (≈20–30%) pre-unlearning, jumps to >85–95% post-unlearning using affected approaches (Arazzi et al., 14 Jun 2025, Song et al., 15 Oct 2025, Alam et al., 17 Feb 2025).

Representative results for the clean unlearning attack on CIFAR-10 with Boundary Unlearning (Arazzi et al., 14 Jun 2025):

Stage Acc_retain (%) Acc_forget (%) ASR (%)
Pre-unlearning 70.8 78.8 28.3
Post-unlearning 62.2 17.6 88.5

4. Failure Modes of Existing Defenses

Conventional backdoor detection and mitigation strategies have proven ineffective against machine-unlearning-triggered backdoors:

  • Neural Cleanse: Assumes a single, conspicuous trigger, but distributed weak triggers are undetectable due to their spread across multiple classes and conflicting label associations.
  • Cognitive Distillation (CD): Fails to remove mid-frequency triggers that overlap with legitimate features.
  • Implicit BAU (I-BAU): Yields partial ASR reduction, but post-unlearning success rates remain unacceptable (often >50–70%).
  • STRIP, GradCAM, Fine-Pruning: Unable to distinguish or eradicate clean unlearning backdoors; entropy and saliency analyses do not localize the distributed trigger signal (Arazzi et al., 14 Jun 2025, Song et al., 15 Oct 2025).

5. Open Questions and Defensive Directions

Current research identifies the urgent need for unlearning-aware defense strategies, with the following proposals and challenges:

  • Defensive Randomization: Introduce stochasticity or randomization into unlearning procedures to disrupt persistent backdoor associations in latent space.
  • Feature-Space Monitoring: Audit internal activations or shuffle representation features during unlearning to identify and sever latent δLt\delta \to L_t pathways.
  • Latent Feature Erosion Detection: Deploy anomaly detectors capable of flagging abrupt feature-space changes (abrupt drop in forget accuracy with persistent backdoor sensitivity).
  • Provable Security: It remains an open technical problem to formally guarantee that unlearning a set FDFD irreversibly removes both direct and indirect (latent) influence paths, including backdoors, from the retained model (Arazzi et al., 14 Jun 2025).

6. Broader Impact and Taxonomy

Backdoor attacks via machine unlearning reveal a novel and potent threat landscape, especially as the "right to be forgotten" and unlearning APIs become widespread. The attack surface is broad, crossing supervised classification, multimodal contrastive learning, federated learning, and even LLMs and diffusion models (Arazzi et al., 14 Jun 2025, Alam et al., 17 Feb 2025, Shang et al., 19 Oct 2025, Lu et al., 21 Aug 2025, Grebe et al., 29 Apr 2025). Attackers can conceal, defer, revoke, or revive backdoors by exploiting the specifics of the unlearning implementation and API. Stealthiness is maximized through strategies such as:

  • Clean, camouflage, or weak-distributed poisoning
  • Non-triggered, “unsuspicious” forget sets
  • Deferred or revocable triggering dependent on model lifecycle events or user-driven unlearning

These techniques often leave the model appearing benign to all standard tests until unlearning is invoked, at which point broad misclassification or malicious functionality is triggered with minimal accuracy degradation on clean inputs.

7. Representative Algorithms and Pseudocode

An archetypal attack pipeline for clean unlearning-triggered backdoors (Arazzi et al., 14 Jun 2025):

1
2
3
4
5
6
7
8
for x in subset_of_clean_training:
    if is_poison_candidate(x):
        x_p = add_trigger(x, delta)
        poison_set.append(x_p)
train_model(clean_set + poison_set)

unlearn_forget_set = corresponding_clean_versions(poisoned_images)
M_prime = unlearn(model, forget_set=unlearn_forget_set)

Key subroutines include DCT-based trigger construction, similarity-driven candidate selection, and unlearning via explicitly supported API (e.g., Boundary Unlearning, Fisher Forgetting, SISA).


Backdoor attacks via machine unlearning expose a critical and subtle vulnerability in modern machine learning systems, undermining trust in data-deletion guarantees and highlighting outstanding challenges in the design of provably robust, backdoor-resilient unlearning mechanisms (Arazzi et al., 14 Jun 2025, Song et al., 15 Oct 2025, Alam et al., 17 Feb 2025, Zhang et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Backdoor Attack via Machine Unlearning.