Universal Adversarial Prefix in Robotic Policies
- The paper demonstrates a universal adversarial prefix that misleads language-conditioned robotic policies across diverse tasks.
- It employs a composite loss function combining continuous action and self-attention feature losses to optimize discrete token sequences.
- Experimental results on the 200M-parameter VIMA model reveal significantly higher attack success rates compared to baseline methods.
A universal adversarial prefix is a short, fixed sequence of discrete tokens that, when prepended to any language (or multimodal) prompt, reliably induces a language-conditioned robotic policy to execute unintended, and frequently incorrect, actions. In the context of language-conditioned robotic learning—where models such as VIMA translate image and text prompts into sequences of executable robot actions—the universal adversarial prefix represents a demonstration of systemic vulnerability. This class of attack is “universal” in that a single optimized prefix is effective across a broad distribution of tasks, prompts, and environments, rather than being tailored for a specific input.
1. Mathematical Formulation
Let denote a language-conditioned robot policy, where maps a multimodal prompt and history to a sequence of discrete robot actions. The universal adversarial prefix, denoted with (the vocabulary), takes the form of a token sequence prepended to any legitimate prompt —resulting in the composed prompt . Each is one-hot encoded as , and embedded into model space by the token embedding matrix , yielding .
The adversary’s objective is to identify a that, when applied to any , reduces the probability that outputs the correct action . For untargeted attacks, this is formalized by maximizing the (feature-based) loss function over a distribution ,
where and are the continuous action and self-attention feature losses, respectively, and are weighting hyperparameters. Constraints include prefix length and, optionally, embedding norm bounds.
2. Continuous Action Representation
Robotics models like VIMA convert input prompts into actions via a two-stage process: a controller maps continuous embeddings (with history ) to a continuous action vector , and a decoder discretizes this output. The overall policy is then .
Adversarial attacks that operate purely on discrete action outputs are largely ineffective, as the discretization step in induces robustness. The proposed method circumvents this by defining the adversarial loss in the continuous action space:
Maximizing this loss exploits the differentiability of , enabling gradient-based optimization with respect to . No gradients are propagated through , thus sidestepping nondifferentiability.
3. Exploiting Intermediate Self-Attention Features
Beyond manipulating the final action layer, the method incorporates losses based on intermediate self-attention activations. Let denote the output of the -th self-attention layer of the decoder, pooled or flattened and aggregated across heads; a summary vector is constructed by concatenating or averaging these. The self-attention feature loss is:
The total prefix loss is the weighted sum:
Gradients of this composite loss with respect to are computed and used to guide token-level substitutions, amplifying adversarial efficacy.
4. Optimization: Adversarial Distillation via Greedy Coordinate Gradient
A Greedy Coordinate Gradient (GCG) algorithm is applied to optimize in the token space:
- Initialize with random tokens.
- For each iteration:
- For a batch of prompt/history pairs, compute clean and adversarial continuous actions and self-attention features: , , , .
- Compute batch loss as the average sum of and .
- Backpropagate to obtain gradients with respect to each prefix token embedding.
- For each token position , gradients with respect to the one-hot encoding are computed (), and the top- candidate replacements in vocabulary (maximizing the negative inner product with ) are identified.
- Randomly select a token position and substitute its value with one of its top- candidates, keeping the replacement if it increases loss.
- Terminate after a fixed number of steps or loss convergence.
This algorithm ensures tractable optimization over the discrete token space, directly leveraging the gradients from continuous and self-attention features to misalign internal and output representations. The stopping criterion is commonly a fixed step budget (e.g., ), or empirical convergence of the loss.
5. Experimental Protocol and Results
Experiments target the 200M-parameter VIMA model, configured with dual Mask R-CNN plus ViT vision encoders, a T5-based multimodal encoder, and transformer decoder, evaluated across 13 Level-1 tasks from the VIMA-Bench benchmark, which span manipulation, scene understanding, object attribute generalization, and spatial reasoning tasks.
The primary metric is untargeted attack success rate (ASR): the proportion of demonstrations on which the robot fails its designated task within the step allowance. Evaluation is averaged over 150 test instances per task.
Attack Performance Comparison
| Task (Examples) | Random Prefix | Baseline GCG | Baseline GD | M_GCG | Ours (α=1, β=20) |
|---|---|---|---|---|---|
| VM (Manipulation) | 0.7% | 32.4% | 0.4% | 53.8% | 81.8% |
| SU (Scene Understanding) | 0.9% | 24.7% | 0.4% | 26.7% | 75.1% |
| Ro (Rotate) | 0.4% | 26.7% | 0.2% | 80.2% | 63.8% |
| SS (Same Shape) | 7.8% | 77.3% | 4.4% | 88.9% | 98.4% |
| Average | 20.8% | 35.3% | 19.0% | 39.6% | 47.1% |
Results show that the universal adversarial prefix method significantly outperforms all baselines, achieving both higher average ASR (by approximately 7.5 percentage points) and near-complete task failure on specific categories ("Same Shape").
Ablation studies illustrate that the combined use of continuous and self-attention features is essential: using only discrete loss yields 34% ASR, continuous loss alone yields ~51%, and injecting cross-attention and self-attention features raises performance by up to 22 percentage points over continuous alone.
Transferability analysis demonstrates that a prefix trained on the 200M VIMA model remains effective on a 92M-parameter variant, in some settings improving ASR (e.g., 52% ASR at 10 tokens for gray-box, compared to 33% white-box), reflecting strong cross-model generalization.
6. Context and Security Implications
The universal adversarial prefix exemplifies the vulnerability landscape in language-conditioned robotics, where input prompts can be subtly manipulated to produce broad task failures even by large, multimodal, and heavily trained architectures. The findings indicate that discretized action spaces alone do not confer safety, and effective attacks can be constructed by targeting differentiable intermediates—particularly continuous action heads and self-attention representations.
This suggests a latent risk in deployment settings where prompts are modifiable, and highlights the importance of evaluating intermediate feature robustness in addition to final action outputs. A plausible implication is the need for new defense strategies operating at multiple levels in the perception–action pipeline, or for adversarial training that incorporates universal prefix attacks in the loop.
7. Significance and Future Directions
The adversarial distillation methodology—strategic optimization of discrete prefix tokens using feature-based gradients over both action and attention layers—marks an advance over token-level and softmax-based attack frameworks, with potential applicability in broader multimodal and language-model settings. Future research may focus on enlarging the supported prompt and task space, developing robust defense mechanisms, and analyzing transferability across diverse architectures and domains (Zhao et al., 2024).