Clean-Label Backdoor Attacks

Updated 3 December 2025

Clean-label backdoor attacks are data poisoning methods that inject imperceptible, trigger-bearing samples with truthful labels to evade detection.
They rely on optimized trigger design, selective sample targeting, and gradient alignment to induce adversarial behavior with high success rates at low poison rates.
Current defenses struggle to detect these attacks due to their label-consistent nature, highlighting the need for robust detection methods that maintain clean accuracy.

Clean-label backdoor attacks represent a class of data poisoning threats against machine learning models in which the attacker injects carefully designed poisoned examples into the training set, but crucially, does not modify their ground-truth labels. By ensuring data-label consistency, these attacks evade easy detection by human annotators and label-sanitization defenses, while implanting model behaviors that can be maliciously activated at inference time through specific input perturbations (“triggers”). Clean-label backdoors have been demonstrated across vision, NLP, speech, graph, and video domains, exhibiting high effectiveness and resilience to many existing defense mechanisms. Research has produced a range of algorithms exploiting sample selection, feature-space manipulation, proxy-model guidance, and multi-trigger compositionality, revealing substantial challenges for practical model trustworthiness.

1. Problem Formulation and Threat Models

A clean-label backdoor attack targets the training pipeline of a machine learning model by introducing a small set of modified (poisoned) samples. Each poisoned sample remains labeled with its true class, distinguishing clean-label attacks from traditional (dirty-label) backdoors, which assign labels inconsistent with the input content. The attacker’s goal is to cause the model—after normal training on this “poisoned” data—to behave as usual on clean inputs, but consistently misclassify any input containing a secret trigger as an attacker-chosen target class.

Formally, given a training set $\mathcal{D}_{\mathrm{train}} = \{(x_i, y_i)\}$ , an attacker selects a subset $\mathcal{D}_{\mathrm{poison}} \subset \mathcal{D}_{\mathrm{train}}$ (with $y_i=y_{\mathrm{target}}$ ), modifies inputs via a trigger-insertion or feature-space perturbation, and injects these into the training set, keeping labels unchanged. The resulting poisoned set is

$\mathcal{D}_{\mathrm{train}}^* = \mathcal{D}_{\mathrm{train}} \setminus \mathcal{D}_{\mathrm{poison}} \cup \{(f_g(x), y_{\mathrm{target}}) \mid x \in \mathcal{D}_{\mathrm{poison}}\}$

where $f_g$ is a trigger-embedding function. At inference, any sample $x^*$ manipulated by $f_g$ will be misclassified as $y_{\mathrm{target}}$ with high probability. The attack aims to maximize attack success rate (ASR) for triggered inputs, while preserving near-original clean accuracy on benign data (Wu et al., 24 Sep 2025).

Threat models vary in required attacker knowledge:

Full data access: Original methods often presume access to the entire training set for sample selection.
Target-class only: More challenging settings, such as “target-only” or “data-free” scenarios, require only a few (or no) examples from the target class (Zeng et al., 2022, Ma et al., 2023).
Black-box vs. white-box: Some attacks assume white-box access to a surrogate or proxy for loss/feature computation; others work in a black-box or cross-architecture transfer regime (Nguyen et al., 2024, Dai et al., 19 Mar 2025).

2. Core Algorithmic Principles

Clean-label backdoor attacks are unified by manipulating feature associations rather than label mapping. The main algorithmic elements include:

a. Trigger Design and Embedding

Pattern type: Triggers can be fixed patterns (e.g., small image patches, special words, audio waveforms), imperceptible perturbations, or even semantic subcomponents (e.g., node types in graphs) (Dai et al., 19 Mar 2025, Gong et al., 2024).
Feature-space alignment: Many attacks optimize triggers in feature space to steer poisoned examples towards the centroid of the target class, using surrogate models for guidance (Zeng et al., 2022, Luo et al., 2022, Yin et al., 29 Apr 2025, Ma et al., 2023).
Image- or context-specific triggers: Recent work highlights the effectiveness and stealthiness of input-adaptive, image-specific perturbations produced by generators such as U-Nets (Luo et al., 2022, Ma et al., 2023).

b. Selective Poison Sample Targeting

Hard sample selection: Poisoning the target-class samples that are hardest for the model to classify (as determined by surrogate loss, confidence, or forgetting events) yields significantly improved attack success at lower poison rates (Wu et al., 24 Sep 2025, Kirci et al., 21 Aug 2025, Nguyen et al., 2024, Xinyuan et al., 2024).
Category diversity weighting: For global triggers, distributing poison selection across samples with varied misclassification profiles enhances robustness (Wu et al., 24 Sep 2025).

c. Trigger-Label Dependency Induction

Feature/gradient matching: By explicitly aligning the gradients induced by trigger-embedded samples to those that would be induced by true target-class feature activations, and/or enforcing alignment in feature space, attacks amplify the model’s reliance on the trigger (Dao et al., 2024).
Proxy-guided optimization: Surrogate models pretrained on auxiliary data can guide both the design of the trigger and the identification of optimal poison targets (Zeng et al., 2022, Yin et al., 29 Apr 2025).

d. Multi-target and Physical Triggers

Multi-target triggers: Recent advances include frameworks for full-target attacks, where a unique trigger is crafted for each class, such as via class-conditional autoencoders (Yin et al., 29 Apr 2025).
Physical-world triggers: Clean-label attacks with physical triggers (e.g., naturalistic accessories in vision, acoustic patterns in speech) have been demonstrated, with matching algorithms ensuring that poisoned training data is not digitally labeled with the trigger, but instead has its features perturbed to align with the “trigger distribution” (Dao et al., 2024, Xinyuan et al., 2024).

3. Empirical Results Across Modalities

Clean-label backdoor attacks have been studied in vision, language, audio, video, and graph domains, displaying consistently high attack effectiveness and transferability.

Domain	Typical Poison Rate	ASR (%)	Stealth Impact	Notable Approaches
Image CLS	0.05–1%	90–99	visually imperceptible	Feature-matching, adversarial, autoencoder (Zeng et al., 2022, Luo et al., 2022, Yin et al., 29 Apr 2025, Ma et al., 2023)
Video	0.3–1%	80–90	full spatiotemporal	Universal adversarial triggers (Zhao et al., 2020)
Graph	1–3%	90–99	node features only	Semantic node triggers, feature copy (Dai et al., 19 Mar 2025, Xing et al., 2023)
NLP	1–2%	90–99	high human label consistency	Adversarial text, style triggers, paraphrasing (Chen et al., 2022, You et al., 2023, Kirci et al., 21 Aug 2025, Gupta et al., 2023)
Speech	1.5–10%	99	minor accuracy loss	Proxy-based, signal insertion (Xinyuan et al., 2024)
Anomaly Det.	~5%	87–97	minor AUC reduction	Latent alignment/concentration (Cheng et al., 2024)

Notable findings:

Attacks using advanced sample selection or feature-guided triggers (e.g., “Narcissus,” “DFB,” class-conditional autoencoders) achieve ASRs>90% at $<0.5\%$ poison rates (Ma et al., 2023, Zeng et al., 2022, Yin et al., 29 Apr 2025).
In NLP, LLM-based paraphrase-style triggers (“LLMBkd”) and adversarial sample mining (Kallima) yield high ASR and human-imperceptibility (You et al., 2023, Chen et al., 2022).
Graph backdoors using semantic triggers (e.g., SCLBA) reach ASR≈99% with poisoning $<3\%$ , outperforming subgraph baselines (Dai et al., 19 Mar 2025, Xing et al., 2023).
Active learning pipelines are vulnerable, as acquisition functions can be exploited to ensure poisoned samples are repeatedly selected—even at poison budgets as low as 0.5% (Zhi et al., 5 Aug 2025).
In object detection, label-preserving trigger alignment enables both disappearance and generation attacks with ASR>92% at poison rates of 5–10%, while preserving mAP (Cheng et al., 2023).

4. Model-Specific and Domain-Specific Innovations

Recent research emphasizes tailoring clean-label attacks to the victim’s architecture and intended domain:

Vision Transformers (ViT): Loss formulations leveraging attention rollout alignment and diffusion covering the trigger region are highly effective; sub-trigger masking further boosts stealth and robustness against known ViT defenses (Gong et al., 2024).
Multi-class/multi-target robustness: FFCBA’s FSBA/FMBA paradigms create differentiated triggers for each class, using class-conditional autoencoders and latent centroids to maximize intra-class consistency and inter-class separability. FMBA demonstrates improved transfer to unseen models (Yin et al., 29 Apr 2025).
Image-specific and two-phase triggers: U-Net–based generators train to produce perturbations that simultaneously support both implantation (network erasure) and activation (trigger reliance), with LPIPS regularization for imperceptibility (Luo et al., 2022).
Trigger intensity/channel design: Exploiting human visual system color-sensitivity allows amplification of signal in blue channels, maximizing ASR for a fixed visibility threshold (Wu et al., 24 Sep 2025).

5. Defenses and Their Limitations

Clean-label backdoors are designed to evade most classical data- or label-sanitization approaches, necessitating specialized defenses:

Outlier detection methods (activation clustering, spectral signatures, DeepKNN) are largely circumvented, as synthesizing the trigger to align with target-class features obfuscates spectral or feature anomalies (Dao et al., 2024, Ma et al., 2023).
Fine-Pruning and pruning-based approaches can suppress backdoors only at the cost of significant clean accuracy drop, since the trigger-correlated neurons overlap with essential features (Zeng et al., 2022, Yin et al., 29 Apr 2025).
Certified and ensemble defenses in NLP, such as Deep Partition Aggregation (DPA/S-DPA), can mitigate ASR to ~13% but require high computational cost and may hurt accuracy, especially on harder tasks (Gupta et al., 2023).
Paraphrasing and input transformations reduce efficacy of text triggers but are weak against semantically reinforced or style-based triggers (Chen et al., 2022, You et al., 2023).
Filtering/denoising is somewhat effective for adversarial noise components in speech, but fails for piggybacked physical or semantic triggers (Xinyuan et al., 2024).
Feature- and attention-map monitoring offer possible future defense routes, especially for architectures with high interpretability (ViT), but have yet to demonstrate detection of advanced low-budget clean-label attacks (Gong et al., 2024).

A common theoretical theme is that, since the trigger is made a true statistical feature of the target class, defenses aimed at removing it will degrade model accuracy on genuine examples, creating an inherent trade-off (Zeng et al., 2022, Yu et al., 2024, Ma et al., 2023).

6. Theoretical Foundations and Generalization

Advances in theory have formalized generalization error bounds for clean-label backdoor attacks, demonstrating that both clean and backdoor test errors can be controlled by empirical risk on the poisoned set, up to vacancy/capacity terms (Yu et al., 2024). Trigger design constraints (feature persistence, statistical indistinguishability) can be precisely linked to both the attacker’s efficacy and the defender’s inability to filter backdoors without harming normal performance.

A key insight is the “inward-pointing trigger” lemma: the optimal trigger direction is proportional to the vector from the sample to the class centroid—hence, erasing triggers also damages clean-class discrimination (Zeng et al., 2022).

7. Open Problems and Future Directions

While clean-label backdoor attacks are now established as viable and stealthy threats, several open challenges persist:

Provable defense design: Developing detection and purification methods that do not undermine clean accuracy, potentially via feature or manifold regularization.
Compositional, physical, and multi-modal triggers: Extending attacks—and defenses—to handle triggers that survive real-world augmentations, compositional attacks (multiple triggers/targets), and cross-modal associations.
Attack stability and transferability: Reducing variability in ASR due to stochastic training, and ensuring that proxy-optimized triggers generalize across architectures and data regimes.
Semantic and interpretability-based analysis: Deploying interpretable models or attribution techniques to identify trigger-induced decision boundaries or anomalous dependence on rare features.

In sum, clean-label backdoor attacks have demonstrated potency across data modalities, learning paradigms, and threat models, representing an active frontier for both machine learning security and trustworthy artificial intelligence (Dao et al., 2024, Zeng et al., 2022, You et al., 2023, Yin et al., 29 Apr 2025, Wu et al., 24 Sep 2025, Ma et al., 2023).