Inference-Phase Attacks in ML

Updated 16 November 2025

Inference-phase attacks are adversarial strategies that manipulate training data to implant hidden triggers, activating malicious behaviors during inference without affecting clean data performance.
They employ techniques like backdoor injection and data poisoning, where subtle triggers cause targeted misclassifications without alerting standard performance metrics.
Defense approaches involve reverse-engineering triggers, activation clustering, and policy regularization to mitigate the high success rates and operational stealth of these attacks.

Inference-phase attacks encompass adversarial strategies that actively exploit vulnerabilities during the inference stage of machine learning systems, with a particular focus on threats seeded during training. These attacks can be categorized by their structural mechanisms (e.g., backdoors, data poisoning), stealth characteristics, and operational impact on deployed models. A defining attribute is the activation of malicious functionality in the inference stage—often conditional on crafted or triggered inputs—with minimal degradation of nominal accuracy on clean data, contributing to their operational stealth and severity.

1. Attack Taxonomy and Mechanistic Overview

Inference-phase attacks are primarily predicated on training-time manipulations that intentionally or inadvertently encode conditional logic within the learned decision boundaries. The prototypical attack is the backdoor (Trojan) injection, wherein an adversary alters a subset of training data to contain a trigger pattern (e.g., a pixel patch, subtle phase shift, or structural mask) and relabels those to a target class. The resultant classifier functions as expected on clean input but decisively misclassifies any input presenting the trigger at inference, thus realizing the attack only post-deployment (Xiang et al., 2020, Davaslioglu et al., 2019, Shen et al., 2022).

A secondary class involves poisoning the training set to quietly distort decision boundaries or nearest-neighbor structures, as seen in $k$ NN insertion attacks, where an attacker adds crafted data points to manipulate the model's predictions at test time (Vartanian et al., 2022). In adversarial multi-agent or RL contexts, implicit poisoning attacks manipulate the peer policy or environment dynamics during training, such that at inference, the victim agent adopts attacker-prescribed behaviors or policies, potentially violating optimality or safety constraints (Mohammadi et al., 2023, Bector et al., 2024, Rakhsha et al., 2020).

2. Technical Formulation and Threat Models

The canonical threat model for backdoor attacks is formalized as the construction of a training set $\mathcal{D}_{\rm poison}$ defined by: $\mathcal{D}_{\rm poison} = \mathcal{D}_{\rm clean} \cup \left\{(\tilde{x}_j, y_t) : \tilde{x}_j = T(x_j, m, t), x_j \in \mathcal{D}_{S^*} \right\}.$ Here, $T$ stamps trigger $t$ at mask $m$ onto source-class input $x_j$ , with relabeling to target $y_t$ . At inference, for any $x$ bearing $t$ , the model $f(\cdot;\Theta)$ yields $\Pr[f(x;t;\Theta) = y_t] \approx 1$ , while $\Pr[f(x;\Theta) = y]$ for clean $x$ remains high (Xiang et al., 2020, Wang et al., 2022). Trojan attacks in wireless signal classification similarly rotate I/Q phase by a fixed $\Delta\theta$ , relabel, and exploit the trigger at inference (Davaslioglu et al., 2019).

Reinforcement learning–specific threats leverage the manipulation of reward functions $\widehat{R}$ or transition kernels $\widehat{P}$ , with formal constraints ensuring the attacker's target policy $\pi^\dagger$ is optimal under the poisoned MDP: $\rho(\pi^\dagger,\widehat{M}) \geq \rho(\pi^{(s \to a)},\widehat{M}) + \epsilon$ for all $s$ and $a \neq \pi^\dagger(s)$ , with cost bounded by $\|\widehat{R} - \overline{R}\|_p$ or $\|\widehat{P} - \overline{P}\|_p$ . Implicit poisoning in multi-agent RL constrains the attack to manipulations of peer policies, making even feasibility of the attack NP-hard in general (Mohammadi et al., 2023).

Secure multiparty computation (MPC) introduces an attack vector by allowing an active adversary to inject covert additive errors at each multiplication, enabling full integrity and privacy compromise during training without detection at inference (Jagielski et al., 21 May 2025).

3. Empirical Characteristics and Evaluation Metrics

Empirical evaluation consistently demonstrates high attack success rates (ASR) with minimal impact on benign accuracy (BA). Representative results include:

Setting	ASR (%)	BA Drop (%)
CIFAR-10 backdoor (imperceptible triggers)	≥92	0
Wireless modulation Trojan (10–20% poisoned)	>90	≤2
MIM supply-chain attacks (multiple triggers)	98.89	<1
MPC-gradient shifting (FMNIST)	99.9	–0.8

Dimensionality reduction methods (e.g., t-SNE/SVM clustering of activation vectors) achieve >98% detection accuracy for Trojan attacks at low poison fractions (Davaslioglu et al., 2019). In robust defenses, methods like NONE reduce ASR to <2% for standard backdoors while maintaining BA within 1–3% of clean training (Wang et al., 2022). RL attacks employing reward or dynamics poisoning attain arbitrarily high policy teaching rates under mild conditions and low average poison cost (Rakhsha et al., 2020).

4. Defense Strategies and Their Effectiveness

Defenses span input-level, model-level, and dataset-level techniques, including:

Reverse-engineering backdoor pattern via constrained optimization: recovers an attack-relevant perturbation $\delta$ by minimizing expected loss plus norm penalty (Xiang et al., 2020).
Activation-based clustering: identifies hidden-layer outliers, with t-SNE projection and RBF SVM yielding high detection rates for Trojans (Davaslioglu et al., 2019).
NONE algorithm (Non-Linearity Enforcement): periodically resets overly linear neurons and filters poisoned samples by Gaussian modeling of activation outliers, robustly precluding the hyperplane structure exploited by backdoors (Wang et al., 2022).
Spectral Signatures: flags poisoned samples by covariance outlier scores; efficacy diminishes for low poison ratios or for release-phase attacks (Shen et al., 2022).
STRIP (input perturbation and entropy analysis): detects entropy anomalies due to triggers; less effective for supply-chain (release-phase) backdoors (Shen et al., 2022).
Policy regularization and peer-policy verification in RL: mitigates implicit poisoning by constraint-based optimization and adversarial training (Mohammadi et al., 2023).

Empirically, dataset-level defenses fail for release-phase backdoors; input-level defenses are limited by trigger occlusion and pattern complexity. NONE achieves lowest ASR against both injected and natural backdoors across datasets and architectures (Wang et al., 2022).

5. Computational Complexity and Stealth Considerations

Optimal training-phase attacks, especially those on $k$ NN and multi-agent RL, are NP-hard to compute (Vartanian et al., 2022, Mohammadi et al., 2023). Practical poisoning algorithms adopt anytime or greedy heuristics scaling polynomially with attacker budget. RL-based policy poisoning utilizes convex approximation or bi-level neural optimization, with empirical attack cost scaling sublinearly with deviation from default peer policy.

Stealth is a critical dimension. Most attacks persist undetectable by output-consistency, test accuracy, or training monitoring until the trigger is encountered at inference. In passively-secure MPC, additive-error injection achieves "zero risk" of detection under the passive model (Jagielski et al., 21 May 2025). Supply-chain attacks in MIM leverage pre-training phase vulnerabilities; release-phase triggers evade all tested model-, input-, and dataset-level detectors (Shen et al., 2022).

6. Open Problems, Limitations, and Future Directions

Contemporary research highlights open challenges, notably:

Efficiency and scalability: Reverse engineering backdoors via exhaustive pairwise searches is $O(K^2N_bN_{iter})$ in large $K$ ; heuristic improvements are under investigation (Xiang et al., 2020).
Natural triggers: Evolving threat of "natural backdoors," wherein prominent patch features act as emergent triggers even in benign data, evading most detection frameworks (Wang et al., 2022).
Phase-specific defenses: Existing defenses exhibit phase sensitivity; release-phase backdoors in supply-chain attacks evade all tested mechanisms, requiring novel approaches (Shen et al., 2022).
Partial-knowledge or adaptive attackers: Extending robust algorithms to settings without full MDP/model access, continuous domains, or code-poisoning threats (Rakhsha et al., 2020, Wang et al., 2022).
Insider attacks in secure computation: The disconnect between passive and active security in MPC threatens both privacy and integrity; enforced MAC-based active security is required but incurs communication and computational overhead (Jagielski et al., 21 May 2025).

A plausible implication is that the arms race between attack sophistication and defense generality will further accentuate the need for verifiable, phase-tailored security guarantees in real-world model deployment.

7. Connections to Policy, Practical Deployment, and Research Directions

Inference-phase attacks represent a foundational security concern across ML and RL domains. Their operational stealth and technical tractability prompt the need for adversarial training, rigorous pre-deployment cleansing, polarity checks in federated settings, and the adoption of actively-secure MPC for privacy-preserving learning when training data or computation is outsourced or distributed (Xiang et al., 2020, Shen et al., 2022, Jagielski et al., 21 May 2025).

The demonstrated feasibility of targeted poisoning on $k$ NN classifiers (Vartanian et al., 2022), self-supervised pre-training supply chains (Shen et al., 2022), and multi-agent RL environments (Mohammadi et al., 2023, Bector et al., 2024, Rakhsha et al., 2020) necessitates comprehensive, system-level defensive design. Interdisciplinary research into activation characterization, hyperplane avoidance, adaptive policy regularization, and secure computation protocols remains critical for resilient AI systems.