Embedding-Space Adversarial Attacks

Updated 20 November 2025

Embedding-space adversarial attacks are techniques that inject small, structured perturbations into neural network embeddings, exploiting the linearity of continuous representation spaces.
They utilize gradient-based methods like FGSM and PGD to craft subtle yet effective perturbations across diverse domains such as NLP, vision, and retrieval systems.
Their high success rates and transferability highlight significant vulnerabilities in foundational models, underscoring the need for robust defenses like adversarial training and anomaly detection.

Embedding-space-based adversarial attacks are a class of attacks targeting the internal representation space of machine learning models, particularly the continuous vector or embedding spaces used within modern neural, multimodal, and retrieval systems. Unlike traditional attacks that operate on discrete input tokens or raw pixels, these methods inject small, structured, and sometimes imperceptible perturbations directly into the model’s embedding layer or optimize inputs specifically to manipulate their embedding representations. This approach enables a new spectrum of adversarial strategies, exposes unique vulnerabilities across domains (NLP, vision, retrieval, multimodal, and speech), and also challenges traditional robustness defenses predicated on discrete input manipulations.

1. Mathematical Definitions and Attack Objectives

Embedding-space-based adversarial attacks typically manipulate the input (or hidden) embeddings $\mathbf{v}(x)$ of an input $x$ by adding a trainable perturbation $\delta$ subject to norm constraints. The canonical generic objective for classification is:

$\min_{\theta} \;\mathbb{E}_{(x,y)\sim D} \left[ \max_{\|\delta\|_p \le \varepsilon} \mathcal{L}(f_\theta(\mathbf{v}(x)+\delta), y) \right]$

Here:

$\mathbf{v}(x) \in \mathbb{R}^{n \times d}$ is the sequence of $n$ token (or feature) embeddings of dimension $d$ ,
$\delta \in \mathbb{R}^{n \times d}$ is the adversarial embedding perturbation,
$\varepsilon > 0$ restricts the $\ell_p$ -norm magnitude.

This general framework underpins algorithms across modality domains, with suitable adaptation to the internal architecture (e.g., image or text encoder, dense retriever, transformer block).

In multimodal or matching-based systems, the objective is to match a perturbed input embedding to a target embedding. For example, aligning a perturbed image embedding $f_I(x_{\mathrm{adv}})$ to a target text embedding $f_T(t^*)$ :

$\min_{x_{\mathrm{adv}} : \|x_{\mathrm{adv}}-x_0\|_p \leq \varepsilon} \; \|f_I(x_{\mathrm{adv}}) - f_T(t^*)\|_2^2$

This strategy is leveraged to craft imperceptible adversarial examples that break semantic alignment in foundation models such as CLIP and ImageBind (Salman et al., 1 Jul 2024).

2. Core Methodologies

Single-Step and Multi-Step Gradient Attacks

Continuous embedding-space attacks are commonly instantiated using (projected) gradient ascent or fast gradient sign method (FGSM)-style perturbations in the embedding space:

$\delta = \delta^0 + \alpha \frac{\nabla_{\delta}\mathcal{L}(f_\theta(\mathbf{v}(x) + \delta^0), y)}{\|\nabla_{\delta}\mathcal{L}\|_2}$

with $\delta^0$ as initialization and $\alpha$ the step size. Such single-step variants have been shown to be nearly as effective as multi-step projected gradient methods, particularly in NLP adversarial training (Yang et al., 23 Jan 2024). For iterative optimization, $\delta$ is updated over several steps with projection onto the allowed norm ball.

Adversarial Document and Soft Prompt Generation

In information retrieval, white-box attacks can operate by perturbing token-level embeddings $\mathbf{e}_d$ to adversarial forms $\tilde{\mathbf{e}}_d$ maximizing similarity to some target (mean-squared error), while maximizing discrete-level difference (cross-entropy with respect to original tokens):

$L_{\text{Attack}}(\phi;d) = \|\phi(\mathbf{e}_d) - \mathbf{e}_d\|_2^2 - \min\bigl(L_{CE}(\phi;d),\,\lambda\bigr)$

where $\phi$ is the learned perturbation model (Li et al., 24 Apr 2025).

In LLM prompt-based attacks, adversarial soft prompts or embedding suffixes can be found via signed gradient descent directly on the input embeddings, bypassing both discrete input and fine-tuning constraints (Schwinn et al., 14 Feb 2024).

Model-Agnostic and Black-Box Optimization

Embedding-space attacks can be black-box (query-based), utilizing pre-trained source models to learn a generative mapping $\mathcal{G} = \mathcal{D} \circ \mathcal{E}$ , and then searching the low-dimensional embedding (latent) space for perturbations that transfer effectively to unknown targets. Efficient NES (Natural Evolution Strategies) gradient estimates in the latent space drive the black-box attack (Huang et al., 2019).

3. Attacked Modalities and Target Systems

Embedding-space adversarial attacks have been broadly applied to:

NLP models: perturb word or token embeddings to induce misclassification or reveal memorized/unlearned content (Yang et al., 23 Jan 2024, Schwinn et al., 14 Feb 2024, Chattopadhyay et al., 3 Apr 2024).
Multimodal models: undermine the alignment of pre-trained encoders across modalities, e.g., image-text or video-text, by crafting continuous or input-space perturbations that break semantic correspondence (Salman et al., 1 Jul 2024, Ye et al., 2023, Shayegani et al., 2023).
Dense retrieval systems: poison corpus by injecting adversarially-perturbed documents whose embeddings mimic those of real targets to hijack high retrieval ranks (Li et al., 24 Apr 2025).
Speaker verification and biometrics: use embedding- or feature-space FGSM to degrade identity verification, with high transferability across embedding types (Li et al., 2019).
Graph/network embedding models: flip graph edges to maximally disrupt the node’s embedding in community detection and classification, leveraging GCN loss gradients (Chen et al., 2018).
Face recognition and computer vision: explicitly attack (and defend) the penultimate representation layer (embedding space), with triplet margin losses to improve robustness (Zhong et al., 2019).

4. Empirical Findings and Attack Success

Embedding-space attacks routinely demonstrate high success rates and transferability:

In multimodal settings, visually indistinguishable image perturbations can be constructed that force the model to interpret the image as any target text, with reported 100% success under tight $\ell_2$ constraints (mean $\ell_2$ distortion $<1.0$ ) (Salman et al., 1 Jul 2024). Qualitative results show semantic alignment can be broken arbitrarily with no perceptible change to input.
In LLMs, soft prompt embedding attacks bypass state-of-the-art safety alignment; universal perturbations trained on a fraction of behaviors retain $\ge$ 95% cumulative success on unseen instructions (Schwinn et al., 14 Feb 2024). Embedding poisoning at deployment time achieves 96.43% mean ASR across six aligned LLMs, while preserving benign utility and evading detection (Yuan et al., 8 Sep 2025).
In DNN image classifiers and retrieval systems, TREMBA and analogous latent-space search methods cut query budgets by 2–6 $\times$ on black-box targets while boosting attack transfer rates over previous baselines (Huang et al., 2019).

5. Defenses and Robustness Strategies

Adversarial defenses for embedding-space attacks fall into several categories:

Embedding-Space Adversarial Training: Exposing the model during training to continuous-embedding perturbations (using efficient, sometimes single-step, attacks) consistently increases robustness not only to embedding-space, but also to discrete-space attacks (e.g., synonym or token-level), and is scalable to large models (Yang et al., 23 Jan 2024, Xhonneux et al., 24 May 2024). Margin-based triplet embedding regularizations also enhance robustness by enforcing a margin separation in the embedding space between true and impostor classes (Zhong et al., 2019).
Ensembles with Varied Embedding Dimensionality: Constructs ensembles of models with diverse input embedding dimensions, exploiting the observed lack of transferability across dimension-mismatched models (Chattopadhyay et al., 3 Apr 2024).
Bi-level and Adversarial Removal: Approaches such as BEEAR construct and neutralize universal embedding drifts associated with backdoors or triggers without knowing their discrete form, by adversarially fine-tuning to negate the effect of identified embedding perturbations (Zeng et al., 24 Jun 2024).
Embedding-space Anomaly Detection and Smoothing: Defense strategies include comparing incoming embedding vectors to nearest-neighbor embeddings, purifying off-manifold points, and randomizing local regions around input embeddings to disrupt adversarial effectiveness (Yuan et al., 8 Sep 2025).

Despite these advances, embedding-space attack robustness remains a critical challenge. Adversarial training scales for continuous perturbations, yet detection and defense against stealthy, one-dimensional embedding poisons at deployment remain largely unsolved.

6. Theoretical Insights, Vulnerabilities, and Open Challenges

Several theoretical and empirical insights underpin embedding-space vulnerabilities:

Representation Linearization: The embedding spaces of deep neural networks, in both vision and language, tend to be structured linearly or locally smooth, making them vulnerable to targeted input or hidden-state perturbations that direct the representation across decision boundaries (Zhong et al., 2019, Li et al., 2019).
Cross-Modality and Transferability: Joint embedding spaces in multimodal models are exceedingly flexible, permitting "misalignment at will" via tiny perturbations (Salman et al., 1 Jul 2024). Embedding-poisoning attacks can be model-agnostic and transfer across architectures because they exploit fundamental representational weaknesses, not just classifier-centric decision rules (Yuan et al., 8 Sep 2025).
Dimensionality Sensitivity: In text classifiers, even small mismatches in input embedding dimension sharply limit the transfer of attacks, unlike the high-dimensional pixel spaces of vision (Chattopadhyay et al., 3 Apr 2024).
Attack Efficiency: Embedding-space attacks are highly computationally efficient, as the search is both differentiable and unconstrained by combinatorial discrete input spaces. Continuous PGD- or FGSM-based adversarial training reduces compute by $2$–$3$ orders of magnitude compared to discrete token-level alternatives (Xhonneux et al., 24 May 2024).
Detection Failure and Stealth: Embedding-level perturbations are not observable in input tokens or model weights, and thus evade current safety scans and integrity checks. Detection of such attacks—especially those operating by a single coordinate or token embedding shift—remains an open defense problem (Yuan et al., 8 Sep 2025).

7. Summary Table: Representative Papers and Domains

Domain	Attack Type / Paper	Key Results/Insights
NLP	FAT, C-AdvUL, Triplet Reg., Dim. Ens. (Yang et al., 23 Jan 2024, Xhonneux et al., 24 May 2024, Zhong et al., 2019, Chattopadhyay et al., 3 Apr 2024)	Robustness boost to discrete & embedding attacks; ensemble transfer gap
Multimodal	CLIP/Img.-Text Collisions (Salman et al., 1 Jul 2024, Shayegani et al., 2023, Ye et al., 2023)	100% attack success, alignment fully broken
Retrieval	Corpus Poisoning (Li et al., 24 Apr 2025)	Fast, high-ASR, low-perplexity, undetectable adversarial docs
Speech	i-vector/x-vector (Li et al., 2019)	High FAR/EER; transferability to neural embeddings
Graph/Network	Fast Grad. Attack (Chen et al., 2018)	State-of-the-art misclassification via edge flips
LLMs/Alignment	Embedding Poisoning, BEEAR (Yuan et al., 8 Sep 2025, Zeng et al., 24 Jun 2024, Schwinn et al., 14 Feb 2024)	96%+ attack success; guardrail evasion at minimal cost

This body of work demonstrates that embedding-space-based adversarial attacks constitute a powerful, general, and cross-domain threat, with attack and defense paradigms that frequently diverge from those developed for discrete input manipulation. As embedding spaces form the backbone of nearly all foundational models, their integrity and robustness will remain a central concern in the ongoing evolution of secure, reliable AI systems.