Papers
Topics
Authors
Recent
Search
2000 character limit reached

Universal Adversarial Prefix in Robotic Policies

Updated 10 December 2025
  • The paper demonstrates a universal adversarial prefix that misleads language-conditioned robotic policies across diverse tasks.
  • It employs a composite loss function combining continuous action and self-attention feature losses to optimize discrete token sequences.
  • Experimental results on the 200M-parameter VIMA model reveal significantly higher attack success rates compared to baseline methods.

A universal adversarial prefix is a short, fixed sequence of discrete tokens that, when prepended to any language (or multimodal) prompt, reliably induces a language-conditioned robotic policy to execute unintended, and frequently incorrect, actions. In the context of language-conditioned robotic learning—where models such as VIMA translate image and text prompts into sequences of executable robot actions—the universal adversarial prefix represents a demonstration of systemic vulnerability. This class of attack is “universal” in that a single optimized prefix is effective across a broad distribution of tasks, prompts, and environments, rather than being tailored for a specific input.

1. Mathematical Formulation

Let πθ\pi_\theta denote a language-conditioned robot policy, where πθ(p,h)\pi_\theta(p, h) maps a multimodal prompt pp and history hh to a sequence of discrete robot actions. The universal adversarial prefix, denoted δ=[d1,,dδ]\delta = [d_1, \dots, d_{\ell_\delta}] with diVd_i \in V (the vocabulary), takes the form of a token sequence prepended to any legitimate prompt pp—resulting in the composed prompt δp\delta \oplus p. Each did_i is one-hot encoded as edi{0,1}Ve_{d_i} \in \{0,1\}^{|V|}, and embedded into model space by the token embedding matrix EtokE_\text{tok}, yielding E(δ)=[Etoked1,,Etokedδ]E(\delta) = [E_\text{tok} \cdot e_{d_1}, \ldots, E_\text{tok} \cdot e_{d_{\ell_\delta}}].

The adversary’s objective is to identify a δ\delta that, when applied to any pp, reduces the probability that πθ\pi_\theta outputs the correct action aa^*. For untargeted attacks, this is formalized by maximizing the (feature-based) loss function over a distribution (p,h)D(p, h) \sim \mathcal{D},

δ=argmaxδ:δδE(p,h)[αLcontinuous(δ;p,h)+βLself-attn(δ;p,h)]\delta^* = \arg\max_{\delta: |\delta| \leq \ell_\delta} \mathbb{E}_{(p, h)}\left[ \alpha L_\text{continuous}(\delta; p, h) + \beta L_\text{self-attn}(\delta; p, h) \right]

where LcontinuousL_\text{continuous} and Lself-attnL_\text{self-attn} are the continuous action and self-attention feature losses, respectively, and α,β>0\alpha, \beta > 0 are weighting hyperparameters. Constraints include prefix length δδ|\delta| \leq \ell_\delta and, optionally, embedding norm bounds.

2. Continuous Action Representation

Robotics models like VIMA convert input prompts into actions via a two-stage process: a controller DcD_c maps continuous embeddings E(p)E(p) (with history hh) to a continuous action vector aconta_\text{cont}, and a decoder DaD_a discretizes this output. The overall policy is then πθ(p,h)=Da(Dc(E(p),h))\pi_\theta(p, h) = D_a(D_c(E(p), h)).

Adversarial attacks that operate purely on discrete action outputs are largely ineffective, as the discretization step in DaD_a induces robustness. The proposed method circumvents this by defining the adversarial loss in the continuous action space:

Lcontinuous(δ;p,h)=Dc(E(δp),h)Dc(E(p),h)22L_\text{continuous}(\delta; p, h) = -\| D_c(E(\delta \oplus p), h) - D_c(E(p), h) \|_2^2

Maximizing this loss exploits the differentiability of DcD_c, enabling gradient-based optimization with respect to E(δ)E(\delta). No gradients are propagated through DaD_a, thus sidestepping nondifferentiability.

3. Exploiting Intermediate Self-Attention Features

Beyond manipulating the final action layer, the method incorporates losses based on intermediate self-attention activations. Let Fs(i)(E(p),h)F_s^{(i)}(E(p), h) denote the output of the ii-th self-attention layer of the decoder, pooled or flattened and aggregated across heads; a summary vector Fs(E(p),h)F_s(E(p), h) is constructed by concatenating or averaging these. The self-attention feature loss is:

Lself-attn(δ;p,h)=Fs(E(δp),h)Fs(E(p),h)22L_\text{self-attn}(\delta; p, h) = -\| F_s(E(\delta \oplus p), h) - F_s(E(p), h) \|_2^2

The total prefix loss is the weighted sum:

L(δ;p,h)=αLcontinuous(δ;p,h)+βLself-attn(δ;p,h)L(\delta; p, h) = \alpha L_\text{continuous}(\delta; p, h) + \beta L_\text{self-attn}(\delta; p, h)

Gradients of this composite loss with respect to E(δ)E(\delta) are computed and used to guide token-level substitutions, amplifying adversarial efficacy.

4. Optimization: Adversarial Distillation via Greedy Coordinate Gradient

A Greedy Coordinate Gradient (GCG) algorithm is applied to optimize δ\delta in the token space:

  1. Initialize δ\delta with random tokens.
  2. For each iteration:
    • For a batch of prompt/history pairs, compute clean and adversarial continuous actions and self-attention features: ai0a^0_i, Fi0F^0_i, aiδa^\delta_i, FiδF^\delta_i.
    • Compute batch loss as the average sum of LcontinuousL_\text{continuous} and Lself-attnL_\text{self-attn}.
    • Backpropagate to obtain gradients with respect to each prefix token embedding.
    • For each token position jj, gradients with respect to the one-hot encoding edje_{d_j} are computed (gjEtok(L/(Etokedj))g_j \approx E_\text{tok}^\top (\partial L/\partial (E_\text{tok} e_{d_j}))), and the top-kk candidate replacements in vocabulary VV (maximizing the negative inner product with gjg_j) are identified.
    • Randomly select a token position and substitute its value with one of its top-kk candidates, keeping the replacement if it increases loss.
  3. Terminate after a fixed number of steps or loss convergence.

This algorithm ensures tractable optimization over the discrete token space, directly leveraging the gradients from continuous and self-attention features to misalign internal and output representations. The stopping criterion is commonly a fixed step budget (e.g., T=300T = 300), or empirical convergence of the loss.

5. Experimental Protocol and Results

Experiments target the 200M-parameter VIMA model, configured with dual Mask R-CNN plus ViT vision encoders, a T5-based multimodal encoder, and transformer decoder, evaluated across 13 Level-1 tasks from the VIMA-Bench benchmark, which span manipulation, scene understanding, object attribute generalization, and spatial reasoning tasks.

The primary metric is untargeted attack success rate (ASR): the proportion of demonstrations on which the robot fails its designated task within the step allowance. Evaluation is averaged over 150 test instances per task.

Attack Performance Comparison

Task (Examples) Random Prefix Baseline GCG Baseline GD M_GCG Ours (α=1, β=20)
VM (Manipulation) 0.7% 32.4% 0.4% 53.8% 81.8%
SU (Scene Understanding) 0.9% 24.7% 0.4% 26.7% 75.1%
Ro (Rotate) 0.4% 26.7% 0.2% 80.2% 63.8%
SS (Same Shape) 7.8% 77.3% 4.4% 88.9% 98.4%
Average 20.8% 35.3% 19.0% 39.6% 47.1%

Results show that the universal adversarial prefix method significantly outperforms all baselines, achieving both higher average ASR (by approximately 7.5 percentage points) and near-complete task failure on specific categories ("Same Shape").

Ablation studies illustrate that the combined use of continuous and self-attention features is essential: using only discrete loss yields 34% ASR, continuous loss alone yields ~51%, and injecting cross-attention and self-attention features raises performance by up to 22 percentage points over continuous alone.

Transferability analysis demonstrates that a prefix δ\delta^* trained on the 200M VIMA model remains effective on a 92M-parameter variant, in some settings improving ASR (e.g., 52% ASR at 10 tokens for gray-box, compared to 33% white-box), reflecting strong cross-model generalization.

6. Context and Security Implications

The universal adversarial prefix exemplifies the vulnerability landscape in language-conditioned robotics, where input prompts can be subtly manipulated to produce broad task failures even by large, multimodal, and heavily trained architectures. The findings indicate that discretized action spaces alone do not confer safety, and effective attacks can be constructed by targeting differentiable intermediates—particularly continuous action heads and self-attention representations.

This suggests a latent risk in deployment settings where prompts are modifiable, and highlights the importance of evaluating intermediate feature robustness in addition to final action outputs. A plausible implication is the need for new defense strategies operating at multiple levels in the perception–action pipeline, or for adversarial training that incorporates universal prefix attacks in the loop.

7. Significance and Future Directions

The adversarial distillation methodology—strategic optimization of discrete prefix tokens using feature-based gradients over both action and attention layers—marks an advance over token-level and softmax-based attack frameworks, with potential applicability in broader multimodal and language-model settings. Future research may focus on enlarging the supported prompt and task space, developing robust defense mechanisms, and analyzing transferability across diverse architectures and domains (Zhao et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Universal Adversarial Prefix.