Universal Adversarial Prefix in Robotic Policies

Updated 10 December 2025

The paper demonstrates a universal adversarial prefix that misleads language-conditioned robotic policies across diverse tasks.
It employs a composite loss function combining continuous action and self-attention feature losses to optimize discrete token sequences.
Experimental results on the 200M-parameter VIMA model reveal significantly higher attack success rates compared to baseline methods.

A universal adversarial prefix is a short, fixed sequence of discrete tokens that, when prepended to any language (or multimodal) prompt, reliably induces a language-conditioned robotic policy to execute unintended, and frequently incorrect, actions. In the context of language-conditioned robotic learning—where models such as VIMA translate image and text prompts into sequences of executable robot actions—the universal adversarial prefix represents a demonstration of systemic vulnerability. This class of attack is “universal” in that a single optimized prefix is effective across a broad distribution of tasks, prompts, and environments, rather than being tailored for a specific input.

1. Mathematical Formulation

Let $\pi_\theta$ denote a language-conditioned robot policy, where $\pi_\theta(p, h)$ maps a multimodal prompt $p$ and history $h$ to a sequence of discrete robot actions. The universal adversarial prefix, denoted $\delta = [d_1, \dots, d_{\ell_\delta}]$ with $d_i \in V$ (the vocabulary), takes the form of a token sequence prepended to any legitimate prompt $p$ —resulting in the composed prompt $\delta \oplus p$ . Each $d_i$ is one-hot encoded as $e_{d_i} \in \{0,1\}^{|V|}$ , and embedded into model space by the token embedding matrix $E_\text{tok}$ , yielding $E(\delta) = [E_\text{tok} \cdot e_{d_1}, \ldots, E_\text{tok} \cdot e_{d_{\ell_\delta}}]$ .

The adversary’s objective is to identify a $\delta$ that, when applied to any $p$ , reduces the probability that $\pi_\theta$ outputs the correct action $a^*$ . For untargeted attacks, this is formalized by maximizing the (feature-based) loss function over a distribution $(p, h) \sim \mathcal{D}$ ,

$\delta^* = \arg\max_{\delta: |\delta| \leq \ell_\delta} \mathbb{E}_{(p, h)}\left[ \alpha L_\text{continuous}(\delta; p, h) + \beta L_\text{self-attn}(\delta; p, h) \right]$

where $L_\text{continuous}$ and $L_\text{self-attn}$ are the continuous action and self-attention feature losses, respectively, and $\alpha, \beta > 0$ are weighting hyperparameters. Constraints include prefix length $|\delta| \leq \ell_\delta$ and, optionally, embedding norm bounds.

2. Continuous Action Representation

Robotics models like VIMA convert input prompts into actions via a two-stage process: a controller $D_c$ maps continuous embeddings $E(p)$ (with history $h$ ) to a continuous action vector $a_\text{cont}$ , and a decoder $D_a$ discretizes this output. The overall policy is then $\pi_\theta(p, h) = D_a(D_c(E(p), h))$ .

Adversarial attacks that operate purely on discrete action outputs are largely ineffective, as the discretization step in $D_a$ induces robustness. The proposed method circumvents this by defining the adversarial loss in the continuous action space:

$L_\text{continuous}(\delta; p, h) = -\| D_c(E(\delta \oplus p), h) - D_c(E(p), h) \|_2^2$

Maximizing this loss exploits the differentiability of $D_c$ , enabling gradient-based optimization with respect to $E(\delta)$ . No gradients are propagated through $D_a$ , thus sidestepping nondifferentiability.

3. Exploiting Intermediate Self-Attention Features

Beyond manipulating the final action layer, the method incorporates losses based on intermediate self-attention activations. Let $F_s^{(i)}(E(p), h)$ denote the output of the $i$ -th self-attention layer of the decoder, pooled or flattened and aggregated across heads; a summary vector $F_s(E(p), h)$ is constructed by concatenating or averaging these. The self-attention feature loss is:

$L_\text{self-attn}(\delta; p, h) = -\| F_s(E(\delta \oplus p), h) - F_s(E(p), h) \|_2^2$

The total prefix loss is the weighted sum:

$L(\delta; p, h) = \alpha L_\text{continuous}(\delta; p, h) + \beta L_\text{self-attn}(\delta; p, h)$

Gradients of this composite loss with respect to $E(\delta)$ are computed and used to guide token-level substitutions, amplifying adversarial efficacy.

4. Optimization: Adversarial Distillation via Greedy Coordinate Gradient

A Greedy Coordinate Gradient (GCG) algorithm is applied to optimize $\delta$ in the token space:

Initialize $\delta$ with random tokens.
For each iteration:
- For a batch of prompt/history pairs, compute clean and adversarial continuous actions and self-attention features: $a^0_i$ , $F^0_i$ , $a^\delta_i$ , $F^\delta_i$ .
- Compute batch loss as the average sum of $L_\text{continuous}$ and $L_\text{self-attn}$ .
- Backpropagate to obtain gradients with respect to each prefix token embedding.
- For each token position $j$ , gradients with respect to the one-hot encoding $e_{d_j}$ are computed ( $g_j \approx E_\text{tok}^\top (\partial L/\partial (E_\text{tok} e_{d_j}))$ ), and the top- $k$ candidate replacements in vocabulary $V$ (maximizing the negative inner product with $g_j$ ) are identified.
- Randomly select a token position and substitute its value with one of its top- $k$ candidates, keeping the replacement if it increases loss.
Terminate after a fixed number of steps or loss convergence.

This algorithm ensures tractable optimization over the discrete token space, directly leveraging the gradients from continuous and self-attention features to misalign internal and output representations. The stopping criterion is commonly a fixed step budget (e.g., $T = 300$ ), or empirical convergence of the loss.

5. Experimental Protocol and Results

Experiments target the 200M-parameter VIMA model, configured with dual Mask R-CNN plus ViT vision encoders, a T5-based multimodal encoder, and transformer decoder, evaluated across 13 Level-1 tasks from the VIMA-Bench benchmark, which span manipulation, scene understanding, object attribute generalization, and spatial reasoning tasks.

The primary metric is untargeted attack success rate (ASR): the proportion of demonstrations on which the robot fails its designated task within the step allowance. Evaluation is averaged over 150 test instances per task.

Attack Performance Comparison

Task (Examples)	Random Prefix	Baseline GCG	Baseline GD	M_GCG	Ours (α=1, β=20)
VM (Manipulation)	0.7%	32.4%	0.4%	53.8%	81.8%
SU (Scene Understanding)	0.9%	24.7%	0.4%	26.7%	75.1%
Ro (Rotate)	0.4%	26.7%	0.2%	80.2%	63.8%
SS (Same Shape)	7.8%	77.3%	4.4%	88.9%	98.4%
Average	20.8%	35.3%	19.0%	39.6%	47.1%

Results show that the universal adversarial prefix method significantly outperforms all baselines, achieving both higher average ASR (by approximately 7.5 percentage points) and near-complete task failure on specific categories ("Same Shape").

Ablation studies illustrate that the combined use of continuous and self-attention features is essential: using only discrete loss yields 34% ASR, continuous loss alone yields ~51%, and injecting cross-attention and self-attention features raises performance by up to 22 percentage points over continuous alone.

Transferability analysis demonstrates that a prefix $\delta^*$ trained on the 200M VIMA model remains effective on a 92M-parameter variant, in some settings improving ASR (e.g., 52% ASR at 10 tokens for gray-box, compared to 33% white-box), reflecting strong cross-model generalization.

6. Context and Security Implications

The universal adversarial prefix exemplifies the vulnerability landscape in language-conditioned robotics, where input prompts can be subtly manipulated to produce broad task failures even by large, multimodal, and heavily trained architectures. The findings indicate that discretized action spaces alone do not confer safety, and effective attacks can be constructed by targeting differentiable intermediates—particularly continuous action heads and self-attention representations.

This suggests a latent risk in deployment settings where prompts are modifiable, and highlights the importance of evaluating intermediate feature robustness in addition to final action outputs. A plausible implication is the need for new defense strategies operating at multiple levels in the perception–action pipeline, or for adversarial training that incorporates universal prefix attacks in the loop.

7. Significance and Future Directions

The adversarial distillation methodology—strategic optimization of discrete prefix tokens using feature-based gradients over both action and attention layers—marks an advance over token-level and softmax-based attack frameworks, with potential applicability in broader multimodal and language-model settings. Future research may focus on enlarging the supported prompt and task space, developing robust defense mechanisms, and analyzing transferability across diverse architectures and domains (Zhao et al., 2024).

Markdown Upgrade to Chat

References (1)

Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Universal Adversarial Prefix.

Universal Adversarial Prefix in Robotic Policies

1. Mathematical Formulation

2. Continuous Action Representation

3. Exploiting Intermediate Self-Attention Features

4. Optimization: Adversarial Distillation via Greedy Coordinate Gradient

5. Experimental Protocol and Results

Attack Performance Comparison

6. Context and Security Implications

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Universal Adversarial Prefix in Robotic Policies

1. Mathematical Formulation

2. Continuous Action Representation

3. Exploiting Intermediate Self-Attention Features

4. Optimization: Adversarial Distillation via Greedy Coordinate Gradient

5. Experimental Protocol and Results

Attack Performance Comparison

6. Context and Security Implications

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research