Training-Phase Attacks in ML

Updated 16 November 2025

Training-phase attacks are adversarial interventions that subvert the training process by embedding hidden backdoors activated only by specific trigger patterns.
They exploit techniques like dataset poisoning, computation tampering, and protocol manipulation to achieve high attack success while preserving clean accuracy.
Defensive strategies—including activation outlier analysis, feature-space clustering, and reverse engineering of triggers—are being developed, though reliable detection remains challenging.

Training-phase attacks constitute a critical class of adversarial interventions wherein the attacker subverts machine learning models during training to implant targeted weaknesses—often backdoors or Trojans—that remain dormant on clean input but can be reliably triggered at inference time using specific patterns or manipulations. These attacks are characterized by their stealth: the attacked model typically maintains high accuracy on benign inputs, with the malicious behavior activated only under precisely crafted conditions. Training-phase attacks have been demonstrated across supervised learning, self-supervised representation learning, privacy-preserving training (MPC), reinforcement learning, and even on fundamental nonparametric classifiers such as k-Nearest Neighbors. A defining feature is the attacker's focus on the training data pipeline, model initialization, or training logic, without requiring access to the model after training.

1. Threat Models and Attacker Capabilities

Training-phase attacks vary widely in prerequisites and achievable impact. The minimal attacker model assumes access to the training data ingestion or batch-processing modules; maximal models extend to physical control over hardware (fault injection) or collusion within cryptographic training protocols.

Dataset Poisoning: The archetypal case involves inserting a small, carefully crafted set of poisoning samples into the training set, each containing a trigger (e.g., phase rotation in wireless signals (Davaslioglu et al., 2019), pixel patches (Shen et al., 2022), or imperceptible perturbations (Xiang et al., 2020)) and mislabeling these samples as a target class.
Activation/Computation Tampering: Attacks on the learning computation itself, such as fault-injection in ReLU activations during backpropagation (Breier et al., 2021) or inserting gradient shifts in MPC protocols (Jagielski et al., 21 May 2025), enable backdoors without altering training data or final weights directly.
Environment Poisoning in RL: The adversary perturbs rewards or transition dynamics during training, steering the learned policy to a target “backdoored” behavior, sometimes through iterative trust-region strategies (gammaDDPG (Bector et al., 5 Jan 2024), policy teaching (Rakhsha et al., 2020)).
Supply Chain and Pretraining Attacks: Poisoning occurs in pretraining (e.g., Masked Image Modeling (Shen et al., 2022)) or federated pipelines, targeting unsupervised objectives so that backdoors propagate to all downstream tasks.

Attackers may act as data providers (data poisoning), infrastructure operators (gray-box patch attacks (Al-qudah et al., 2023)), hardware adversaries (fault attacks (Breier et al., 2021)), or protocol insiders (MPC (Jagielski et al., 21 May 2025)); their success is often measured not just by technical capability, but by the tradeoff between stealth (minimal utility loss on clean data) and reliability (controlled misbehavior on triggered input).

2. Formal Mechanisms of Attack

Training-phase attacks are mathematically formalized by their construction of an expanded or altered training dataset or computational process, defining a trigger mechanism and ensuring the model learns a feature-target association accessible only to the attacker.

Trigger Formulation: For instance, a wireless signal trojan uses phase-rotated I/Q vectors, $x_{\text{triggered}} = e^{j\theta}x_{\text{clean}} = G_\theta x_{\text{clean}}$ with $G_\theta$ a rotation matrix (Davaslioglu et al., 2019). In vision, triggers may be pixel patterns $v^*$ added to training images (imperceptible (Xiang et al., 2020) or visible (Shen et al., 2022)).
Training Objective: The classifier is trained to minimize cross-entropy or MIM loss over the union of clean and poisoned data, with few or no code modifications.
Computational Attacks: Fault injection during training alters the layerwise activations so that for a crafted input $x^*$ , a system of linear inequalities is satisfied, guaranteeing misclassification (“fooling inputs” (Breier et al., 2021)).
Reinforcement Learning Attacks: The environment is selectively modified, adjusting $R$ or $P$ (reward or transition) to make a target policy $\pi^*$ uniquely optimal or increasing its reward gap (policy teaching (Rakhsha et al., 2020)), or dynamically adjusting the attack horizon via $\gamma$ adaptation (gammaDDPG (Bector et al., 5 Jan 2024)).
Gray-box and Protocol Attacks: Incremental attacks on intermediate data structures (Al-qudah et al., 2023) or covert additive manipulations in MPC computation (Jagielski et al., 21 May 2025) enable cumulative corruption without gradient information or model transparency.

The attacks often balance magnitude and frequency: e.g., poisoning 10% of non-target samples in a wireless classifier achieves $A_{p, L_t}^p>90\%$ attack success with $<2\%$ clean accuracy drop (Davaslioglu et al., 2019); gray-box image poisoning every epoch with $\alpha=20\%$ cuts inference probabilities by up to 25% with $<$ 2% F1 drop (Al-qudah et al., 2023).

3. Detection and Defense Strategies

Detection of training-phase backdoors is challenging, especially under low poisoning rates and stealthy triggers. Multiple defense strategies have been studied, with varying levels of effectiveness across domains.

Activation Outlier Analysis: Median Absolute Deviation (MAD) on last-layer activations fails when the poisoned set is sparse as distributions overlap substantially (Davaslioglu et al., 2019).
Clustering in Feature Space: Embeddings via t-SNE, followed by RBF-SVM clustering, can separate clean and poisoned points with high accuracy even for poisoning rates $<$ 5% (Davaslioglu et al., 2019). This is the recommended approach for wireless classifiers.
Adversarial Example-Based Probing: Generating targeted adversarial examples (FGSM/IFGSM) and aggregating “mean bias maps” uncovers trigger regions by comparing the mean response of the target class between suspect and clean models (Borji, 2020). Large $\ell_2$ spikes in mean difference reliably identify backdoored classes on MNIST but struggle with very small or distributed triggers.
Reverse Engineering the Pattern: For imperceptible triggers, an explicit optimization is solved to reverse engineer $v^*$ that, when added to or subtracted from class samples, induces target misclassification. The method includes a statistical test for outlier pattern detection (e.g., Gamma fit on estimated $r_{st}$ statistics), and supports data cleansing via removal of suspicious samples (Xiang et al., 2020).
Hyperplane Regularization: Techniques directly targeting the implicated network geometry (NONE (Wang et al., 2022)) regularly reset or penalize neurons whose activations are excessively linear, breaking the formation of the backdoor region defined by an affine hyperplane.
Pipeline and Integrity Checking: For protocol attacks (MPC), enforcing message authentication codes (MACs) or cut-and-choose logic on all multiplications thwarts undetectable additive attacks, at the cost of increased computation (Jagielski et al., 21 May 2025).

Many detection strategies are phase- or threat-specific: spectral signatures may flag poisoned samples in pre-training but not at inference; input filtering or data provenance can block untrusted data streams, but rarely can restore a compromised model after the fact.

4. Evaluation Metrics and Empirical Impact

Experimental evaluation quantifies attack and defense effectiveness across several axes:

Metric	Description	Typical Range (Under Attack)
Clean Accuracy (BA, TA)	Accuracy on unpoisoned (benign) test data	$\sim$ 68–94%
Attack Success Rate (ASR)	Fraction of triggered inputs assigned to the target class (backdoor activation)	$>$ 90% (under 10% poisoning)
Performance Degradation	Drop in clean test F1, accuracy, or utility due to attack or defense	$<$ 2% typically under stealthy attack
Model Confidence (AIP)	Average softmax probability assigned to predictions (decreases under attack)	Drops by 15–25% (Al-qudah et al., 2023)
Training Time Diff (TTD)	Wall-clock training time overhead (should be minimal for stealth)	$<$ 63s/10 epochs (Al-qudah et al., 2023)
False Pos/Neg (defenses)	Rate of failed detection or over-cleaning	Near-zero for clustering; dependent on thresholding

Notably, attacks are highly effective even with minor poisoning (e.g., only 400 samples/11-class wireless classifier, $A_{p,L_t}^p=92.3\%$ (Davaslioglu et al., 2019), or $\sim$ 4.5% poisoned pretraining images in MIM, ASR $>$ 98% (Shen et al., 2022)). Defenses must not only cut ASR to a low value but also avoid excessive accuracy loss; NONE achieves ASR reduction by $\sim$ 49 $\times$ on injected attacks with $<$ 1\% BA loss (Wang et al., 2022).

5. Taxonomy and Variations Across Domains

Training-phase attacks exhibit domain-specific realizations:

Wireless Signal Classification: Phase rotation backdoors undetectable by random augmentation or preprocessing (Davaslioglu et al., 2019).
Image Models: Imperceptible additive triggers (watermarks or local patterns (Xiang et al., 2020)) or visible patches (multiple triggers to survive masking in MIM (Shen et al., 2022)).
Reinforcement Learning: Environment/reward or peer agent policies are poisoned to enforce arbitrary target policies, with guarantees even for suboptimal or infeasible behaviors (Bector et al., 5 Jan 2024, Mohammadi et al., 2023, Rakhsha et al., 2020).
Nonparametric Models (kNN): Optimal data insertions are NP-hard to compute (even for $k=1$ ), but polynomial-time greedy approximations (“GIT”) and hyperedge enumeration (“CHOPPA”) are effective in practice (Vartanian et al., 2022).
Cryptographic Protocols (MPC): Insecure, passively secure implementations are vulnerable to undetectable additive error injection, compromising privacy, model integrity, and enabling exact data reconstruction (Jagielski et al., 21 May 2025).
Self-Supervised/SSL: Poisoning unlabeled data propagates potent backdoors into all downstream evaluators, especially when triggers survive data augmentation (e.g., multiple patch placement in MIM) (Shen et al., 2022).
Physical/Fault Attacks: Transient manipulation of activations during training hardware can permanently implant a “fooling” region into the model without modifying data or weights directly (Breier et al., 2021).

Each context determines effective attack algorithms, required knowledge, and feasible defensive postures.

6. Limitations, Mitigation, and Future Directions

Despite progress, limitations persist in both offensive and defensive techniques:

Stealth, Feasibility, and Transferability: Many attacks presuppose data access during training, which may not hold in closed, end-to-end secured pipelines. Some countermeasures (e.g., random data augmentation in phase space) cause a significant drop in benign accuracy, making them impractical (Davaslioglu et al., 2019).
Detection Gaps: Statistical tests based on activation magnitude (e.g., MAD) are ineffective if poisoning rates are below 5% (Davaslioglu et al., 2019), and trigger localization becomes challenging for minute or distributed patterns (Borji, 2020).
Data and Model Dependence: Some defenses rely on full access to the possibly poisoned dataset for reverse engineering or spectral analysis—a challenging requirement at scale (Xiang et al., 2020).
Protocol Security: Only actively secure MPC protocols fully prevent additive training-phase attacks; passive and covert models are inherently vulnerable (Jagielski et al., 21 May 2025).
Adaptive and Code-Aware Attacks: Slow poisoning, code-poisoning, or attacks directly targeting neuron-activation distributions may partially evade current purification schemes (Wang et al., 2022).
Scalability: Many analytical methods are computationally intensive (e.g., pattern search over all source-target class pairs is $O(K^2)$ in class cardinality (Xiang et al., 2020)), and their performance/cost tradeoff is not well understood for ImageNet-scale datasets.

Promising research directions include training-time model auditing, robust data provenance, cryptographic integrity checks throughout the pipeline, model space regularization for nonlinearity, and hybrid approaches combining data, activation, and protocol-level defenses.

7. Summary Table of Attack/Defense Approaches

Domain / Method	Representative Attack	Efficacy (ASR/Impact)	Stealth	Defense(s)	Key Limitation
Wireless Signal Classifier	Phase-shift data poisoning (Davaslioglu et al., 2019)	$>$ 90% ASR, $<$ 2% acc loss	High	t-SNE+SVM clustering	Needs poisoning access
Image Classifier	Imperceptible additive (Xiang et al., 2020)	$>$ 86% ASR, $<$ 1% acc loss	High	Reverse-engineering + cleanse	High $K$ cost
SSL/MIM	Multi-trigger visible patches (Shen et al., 2022)	$>$ 98% ASR at 4.5% poison	High	Spectral signature (pre-train)	Poor downstream def.
kNN	Data insertion (Vartanian et al., 2022)	$>$ 50% targets flipped (b=10)	High	PCA dimensionality reduction	User utility loss
RL	Reward/env. poisoning (Rakhsha et al., 2020, Bector et al., 5 Jan 2024)	RL policy override, minimal effort	High	Randomized discount, env. audit	Def. may be disruptive
MPC/PPML	Additive error/gradient shift (Jagielski et al., 21 May 2025)	ASR $\sim$ 100%, MI $\uparrow$	Absolute	Active security (MACs)	2–5 $\times$ cost
Gray-box Batch Poison	Incremental patch (Al-qudah et al., 2023)	Conf. drop up to 25%, F $<$ 2%	High	Loss trend, hash checking	Needs raw mem. access

Training-phase attacks thus represent a persistently evolving and under-secured vector for targeted ML sabotage and privacy breach. Their detection and mitigation remain an active area of research, necessitating cross-disciplinary advances in data integrity, computational auditing, learning theory, and secure training protocol design.