Papers
Topics
Authors
Recent
2000 character limit reached

Clean-Label Backdoor Attacks

Updated 3 December 2025
  • Clean-label backdoor attacks are data poisoning methods that inject imperceptible, trigger-bearing samples with truthful labels to evade detection.
  • They rely on optimized trigger design, selective sample targeting, and gradient alignment to induce adversarial behavior with high success rates at low poison rates.
  • Current defenses struggle to detect these attacks due to their label-consistent nature, highlighting the need for robust detection methods that maintain clean accuracy.

Clean-label backdoor attacks represent a class of data poisoning threats against machine learning models in which the attacker injects carefully designed poisoned examples into the training set, but crucially, does not modify their ground-truth labels. By ensuring data-label consistency, these attacks evade easy detection by human annotators and label-sanitization defenses, while implanting model behaviors that can be maliciously activated at inference time through specific input perturbations (“triggers”). Clean-label backdoors have been demonstrated across vision, NLP, speech, graph, and video domains, exhibiting high effectiveness and resilience to many existing defense mechanisms. Research has produced a range of algorithms exploiting sample selection, feature-space manipulation, proxy-model guidance, and multi-trigger compositionality, revealing substantial challenges for practical model trustworthiness.

1. Problem Formulation and Threat Models

A clean-label backdoor attack targets the training pipeline of a machine learning model by introducing a small set of modified (poisoned) samples. Each poisoned sample remains labeled with its true class, distinguishing clean-label attacks from traditional (dirty-label) backdoors, which assign labels inconsistent with the input content. The attacker’s goal is to cause the model—after normal training on this “poisoned” data—to behave as usual on clean inputs, but consistently misclassify any input containing a secret trigger as an attacker-chosen target class.

Formally, given a training set Dtrain={(xi,yi)}\mathcal{D}_{\mathrm{train}} = \{(x_i, y_i)\}, an attacker selects a subset DpoisonDtrain\mathcal{D}_{\mathrm{poison}} \subset \mathcal{D}_{\mathrm{train}} (with yi=ytargety_i=y_{\mathrm{target}}), modifies inputs via a trigger-insertion or feature-space perturbation, and injects these into the training set, keeping labels unchanged. The resulting poisoned set is

Dtrain=DtrainDpoison{(fg(x),ytarget)xDpoison}\mathcal{D}_{\mathrm{train}}^* = \mathcal{D}_{\mathrm{train}} \setminus \mathcal{D}_{\mathrm{poison}} \cup \{(f_g(x), y_{\mathrm{target}}) \mid x \in \mathcal{D}_{\mathrm{poison}}\}

where fgf_g is a trigger-embedding function. At inference, any sample xx^* manipulated by fgf_g will be misclassified as ytargety_{\mathrm{target}} with high probability. The attack aims to maximize attack success rate (ASR) for triggered inputs, while preserving near-original clean accuracy on benign data (Wu et al., 24 Sep 2025).

Threat models vary in required attacker knowledge:

  • Full data access: Original methods often presume access to the entire training set for sample selection.
  • Target-class only: More challenging settings, such as “target-only” or “data-free” scenarios, require only a few (or no) examples from the target class (Zeng et al., 2022, Ma et al., 2023).
  • Black-box vs. white-box: Some attacks assume white-box access to a surrogate or proxy for loss/feature computation; others work in a black-box or cross-architecture transfer regime (Nguyen et al., 15 Jul 2024, Dai et al., 19 Mar 2025).

2. Core Algorithmic Principles

Clean-label backdoor attacks are unified by manipulating feature associations rather than label mapping. The main algorithmic elements include:

a. Trigger Design and Embedding

b. Selective Poison Sample Targeting

c. Trigger-Label Dependency Induction

  • Feature/gradient matching: By explicitly aligning the gradients induced by trigger-embedded samples to those that would be induced by true target-class feature activations, and/or enforcing alignment in feature space, attacks amplify the model’s reliance on the trigger (Dao et al., 27 Jul 2024).
  • Proxy-guided optimization: Surrogate models pretrained on auxiliary data can guide both the design of the trigger and the identification of optimal poison targets (Zeng et al., 2022, Yin et al., 29 Apr 2025).

d. Multi-target and Physical Triggers

  • Multi-target triggers: Recent advances include frameworks for full-target attacks, where a unique trigger is crafted for each class, such as via class-conditional autoencoders (Yin et al., 29 Apr 2025).
  • Physical-world triggers: Clean-label attacks with physical triggers (e.g., naturalistic accessories in vision, acoustic patterns in speech) have been demonstrated, with matching algorithms ensuring that poisoned training data is not digitally labeled with the trigger, but instead has its features perturbed to align with the “trigger distribution” (Dao et al., 27 Jul 2024, Xinyuan et al., 13 Sep 2024).

3. Empirical Results Across Modalities

Clean-label backdoor attacks have been studied in vision, language, audio, video, and graph domains, displaying consistently high attack effectiveness and transferability.

Domain Typical Poison Rate ASR (%) Stealth Impact Notable Approaches
Image CLS 0.05–1% 90–99 visually imperceptible Feature-matching, adversarial, autoencoder (Zeng et al., 2022, Luo et al., 2022, Yin et al., 29 Apr 2025, Ma et al., 2023)
Video 0.3–1% 80–90 full spatiotemporal Universal adversarial triggers (Zhao et al., 2020)
Graph 1–3% 90–99 node features only Semantic node triggers, feature copy (Dai et al., 19 Mar 2025, Xing et al., 2023)
NLP 1–2% 90–99 high human label consistency Adversarial text, style triggers, paraphrasing (Chen et al., 2022, You et al., 2023, Kirci et al., 21 Aug 2025, Gupta et al., 2023)
Speech 1.5–10% 99 minor accuracy loss Proxy-based, signal insertion (Xinyuan et al., 13 Sep 2024)
Anomaly Det. ~5% 87–97 minor AUC reduction Latent alignment/concentration (Cheng et al., 17 Dec 2024)

Notable findings:

  • Attacks using advanced sample selection or feature-guided triggers (e.g., “Narcissus,” “DFB,” class-conditional autoencoders) achieve ASRs>90% at <0.5%<0.5\% poison rates (Ma et al., 2023, Zeng et al., 2022, Yin et al., 29 Apr 2025).
  • In NLP, LLM-based paraphrase-style triggers (“LLMBkd”) and adversarial sample mining (Kallima) yield high ASR and human-imperceptibility (You et al., 2023, Chen et al., 2022).
  • Graph backdoors using semantic triggers (e.g., SCLBA) reach ASR≈99% with poisoning <3%<3\%, outperforming subgraph baselines (Dai et al., 19 Mar 2025, Xing et al., 2023).
  • Active learning pipelines are vulnerable, as acquisition functions can be exploited to ensure poisoned samples are repeatedly selected—even at poison budgets as low as 0.5% (Zhi et al., 5 Aug 2025).
  • In object detection, label-preserving trigger alignment enables both disappearance and generation attacks with ASR>92% at poison rates of 5–10%, while preserving mAP (Cheng et al., 2023).

4. Model-Specific and Domain-Specific Innovations

Recent research emphasizes tailoring clean-label attacks to the victim’s architecture and intended domain:

  • Vision Transformers (ViT): Loss formulations leveraging attention rollout alignment and diffusion covering the trigger region are highly effective; sub-trigger masking further boosts stealth and robustness against known ViT defenses (Gong et al., 6 Dec 2024).
  • Multi-class/multi-target robustness: FFCBA’s FSBA/FMBA paradigms create differentiated triggers for each class, using class-conditional autoencoders and latent centroids to maximize intra-class consistency and inter-class separability. FMBA demonstrates improved transfer to unseen models (Yin et al., 29 Apr 2025).
  • Image-specific and two-phase triggers: U-Net–based generators train to produce perturbations that simultaneously support both implantation (network erasure) and activation (trigger reliance), with LPIPS regularization for imperceptibility (Luo et al., 2022).
  • Trigger intensity/channel design: Exploiting human visual system color-sensitivity allows amplification of signal in blue channels, maximizing ASR for a fixed visibility threshold (Wu et al., 24 Sep 2025).

5. Defenses and Their Limitations

Clean-label backdoors are designed to evade most classical data- or label-sanitization approaches, necessitating specialized defenses:

  • Outlier detection methods (activation clustering, spectral signatures, DeepKNN) are largely circumvented, as synthesizing the trigger to align with target-class features obfuscates spectral or feature anomalies (Dao et al., 27 Jul 2024, Ma et al., 2023).
  • Fine-Pruning and pruning-based approaches can suppress backdoors only at the cost of significant clean accuracy drop, since the trigger-correlated neurons overlap with essential features (Zeng et al., 2022, Yin et al., 29 Apr 2025).
  • Certified and ensemble defenses in NLP, such as Deep Partition Aggregation (DPA/S-DPA), can mitigate ASR to ~13% but require high computational cost and may hurt accuracy, especially on harder tasks (Gupta et al., 2023).
  • Paraphrasing and input transformations reduce efficacy of text triggers but are weak against semantically reinforced or style-based triggers (Chen et al., 2022, You et al., 2023).
  • Filtering/denoising is somewhat effective for adversarial noise components in speech, but fails for piggybacked physical or semantic triggers (Xinyuan et al., 13 Sep 2024).
  • Feature- and attention-map monitoring offer possible future defense routes, especially for architectures with high interpretability (ViT), but have yet to demonstrate detection of advanced low-budget clean-label attacks (Gong et al., 6 Dec 2024).

A common theoretical theme is that, since the trigger is made a true statistical feature of the target class, defenses aimed at removing it will degrade model accuracy on genuine examples, creating an inherent trade-off (Zeng et al., 2022, Yu et al., 2 Jun 2024, Ma et al., 2023).

6. Theoretical Foundations and Generalization

Advances in theory have formalized generalization error bounds for clean-label backdoor attacks, demonstrating that both clean and backdoor test errors can be controlled by empirical risk on the poisoned set, up to vacancy/capacity terms (Yu et al., 2 Jun 2024). Trigger design constraints (feature persistence, statistical indistinguishability) can be precisely linked to both the attacker’s efficacy and the defender’s inability to filter backdoors without harming normal performance.

A key insight is the “inward-pointing trigger” lemma: the optimal trigger direction is proportional to the vector from the sample to the class centroid—hence, erasing triggers also damages clean-class discrimination (Zeng et al., 2022).

7. Open Problems and Future Directions

While clean-label backdoor attacks are now established as viable and stealthy threats, several open challenges persist:

  • Provable defense design: Developing detection and purification methods that do not undermine clean accuracy, potentially via feature or manifold regularization.
  • Compositional, physical, and multi-modal triggers: Extending attacks—and defenses—to handle triggers that survive real-world augmentations, compositional attacks (multiple triggers/targets), and cross-modal associations.
  • Attack stability and transferability: Reducing variability in ASR due to stochastic training, and ensuring that proxy-optimized triggers generalize across architectures and data regimes.
  • Semantic and interpretability-based analysis: Deploying interpretable models or attribution techniques to identify trigger-induced decision boundaries or anomalous dependence on rare features.

In sum, clean-label backdoor attacks have demonstrated potency across data modalities, learning paradigms, and threat models, representing an active frontier for both machine learning security and trustworthy artificial intelligence (Dao et al., 27 Jul 2024, Zeng et al., 2022, You et al., 2023, Yin et al., 29 Apr 2025, Wu et al., 24 Sep 2025, Ma et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Clean-Label Backdoor Attacks.