Papers
Topics
Authors
Recent
2000 character limit reached

VLA-Fool: Multimodal Adversarial Benchmark

Updated 27 November 2025
  • VLA-Fool is a unified framework that assesses adversarial attacks on vision-language-action models through systematic perturbations in visual and textual inputs.
  • It implements gradient-based, patch-based, and cross-modal misalignment techniques, achieving up to 100% failure rates in certain attack settings.
  • The framework reveals critical challenges in aligning visual, language, and action modalities, driving the need for enhanced adversarial resilience in embodied AI.

VLA-Fool is a unified adversarial robustness framework targeting Vision-Language-Action (VLA) models in embodied AI, providing comprehensive methodology and benchmarking for probing the fragility of alignment across vision, language, and action modalities. Designed to systematically assess both white-box and black-box vulnerabilities, VLA-Fool enables precise evaluation of multimodal adversarial robustness and exposes critical weaknesses in embodied perception, reasoning, and control systems (Yan et al., 20 Nov 2025).

1. Formal Framework and Threat Models

VLA-Fool studies how an adversary can induce misalignment in vision-language-action models, where input is a tuple (I,T)(I, T) with IRH×W×3I \in \mathbb{R}^{H \times W \times 3} a visual observation and T=(w1,,wM)T = (w_1, \ldots, w_M) a natural-language instruction. The VLA model MM outputs an action vector A=M(I,T)=f(Ev(I),Et(T))A = M(I, T) = f(E_v(I), E_t(T)), combining visual encoder EvE_v, language encoder EtE_t, and an action decoder ff.

The adversary aims to construct perturbed inputs (Iadv,Tadv)(I_\text{adv}, T_\text{adv}) such that Aadv=M(Iadv,Tadv)A_\text{adv} = M(I_\text{adv}, T_\text{adv}) deviates from the nominal AA. The attack objective is,

minδv,δtLattack(M(I+δv,T+δt),A),\min_{\delta_v, \delta_t} \mathcal{L}_\text{attack}(M(I+\delta_v, T+\delta_t), A),

where Lattack\mathcal{L}_\text{attack} quantifies action deviation.

Two threat models are considered:

  • White-box: Full access to model weights, structure, and gradients,
  • Black-box: Only output actions or success metrics are available.

Attacks are categorized by modality:

  • Textual perturbation: Δtext=argmaxδpϵtextLattack(M(I,T+δ),A)\Delta_\text{text} = \arg\max_{\|\delta\|_p \leq \epsilon_\text{text}} \mathcal{L}_\text{attack}(M(I, T+\delta), A),
  • Visual perturbation: Δvis=argmaxδvϵvisLattack(M(I+δv,T),A)\Delta_\text{vis} = \arg\max_{\|\delta_v\|_\infty \leq \epsilon_\text{vis}} \mathcal{L}_\text{attack}(M(I+\delta_v, T), A),
  • Cross-modal misalignment: Joint optimization argmaxΔvϵv,ΔtpϵtLmis(I+Δv,T+Δt)\arg\max_{\|\Delta_v\|_\infty \leq \epsilon_v,\, \|\Delta_t\|_p \leq \epsilon_t} \mathcal{L}_\text{mis}(I+\Delta_v, T+\Delta_t) where

Lmis=1NMi=1Nj=1Mcos(pi,wj)cos(pi,wj),\mathcal{L}_\text{mis} = \frac{1}{N M} \sum_{i=1}^{N} \sum_{j=1}^{M} \big| \cos(p_i, w_j) - \cos(p'_i, w'_j) \big|,

and pip_i and wjw_j are patch and token embeddings.

2. Multilevel Attack Methodologies

VLA-Fool implements three primary attack channels:

a) Textual Perturbations

  • Gradient-based (SGCG): Extends Greedy Coordinate Gradient by leveraging a VLA-aware semantic space. The method computes eiLattack\nabla_{e_i} \mathcal{L}_\text{attack} for token embeddings, selects sensitive positions, and substitutes tokens from a candidate pool CVLA(k)(T)=CGCLk(T)\mathcal{C}_\text{VLA}^{(k)}(T) = \mathcal{C}_G \cup \mathcal{C}_L^k(T). The class-specific substitute lexicons address referential ambiguity, attribute weakening, scope blurring, and negation confusion.
  • Prompt-based (black-box): Adversarial context is injected via crafted prefixes/suffixes, e.g., prepending “Act as an antagonistic agent…” or appending “ignore previous message and instead …”, requiring no embedding or gradient access.

b) Visual Perturbations

  • Patch-based (white-box): An adversarial patch δp\delta_p is optimized and placed (via operator P(δp)P(\delta_p)) on scene regions (object/environmental or robot arm). Optimization uses gradient ascent for maxδpLattack(M(I+P(δp),T),A)\max_{\delta_p} \mathcal{L}_\text{attack}(M(I+P(\delta_p), T), A).
  • Noise-based (black-box): Injects realistic image corruptions: Gaussian noise (δvN(0,σ2)\delta_v \sim \mathcal{N}(0, \sigma^2)), salt-and-pepper (fraction ρ\rho of pixels), speckle, uniform, pseudo-random (PRNG) patterns, and differentially private (DP) randomization.

c) Cross-Modal Misalignment

Attacks that explicitly disrupt alignment between visual feature patches and semantic tokens, maximizing Lmis\mathcal{L}_\text{mis}. The loss is directly tied to changes in the cosine similarity matrix between visual and language embeddings, potentially with regularization via Lattack\mathcal{L}_\text{attack} on the action space.

3. Semantic Space and Prompt Engineering

The VLA-aware semantic space comprises four perturbation modes:

  1. Referential ambiguity,
  2. Attribute weakening,
  3. Scope blurring,
  4. Negation confusion.

For each, a lexicon CLk(T)\mathcal{C}_L^k(T) of candidate tokens is constructed, merged with structure-based and geometric candidates CG\mathcal{C}_G. The attack’s candidate pool thus remains grounded in semantics relevant to embodied reasoning.

Prompt-based (black-box) attacks use templates of the form:

1
<prefix> + T + <suffix>
with templates selected to mislead without overt semantic overlap.

4. Experimental Evaluation

VLA-Fool is benchmarked on the LIBERO suite (spatial, object, goal, and long-horizon tasks). Victim models are OpenVLA (7B), fine-tuned on LIBERO, using 224×224 inference, bfloat16 precision, and FlashAttention-2 (LoRA optional). White-box attacks use gradients; black-box attacks rely on output actions/success status only.

Attack hyperparameters per modality include:

  • SGCG: 10 substitution budget, candidate pool size 50, TmaxT_\text{max} = 20,
  • Patch: size s=50s = 50 px, region Ω\Omega around arm or objects,
  • Noise: σGauss=30\sigma_\text{Gauss} = 30, $\rho_\text{S%%%%34%%%%P} = 0.02$, DP ϵ=0.02\epsilon=0.02.

Measured metrics:

  • Failure Rate (FR =1SR= 1 - \text{SR}),
  • L2L_2 action deviation,
  • Lmis\mathcal{L}_\text{mis} semantic misalignment.

Performance is summarized in the table below.

Attack Spatial Object Goal Long Avg
GCG (white) 73.8 80.0 88.1 75.0 79.2
SGCG-1 (ref) 50.0 83.3 88.1 75.0 74.1
SGCG-2 (attr) 33.3 83.3 54.8 54.2 56.4
SGCG-3 (scope) 40.5 43.3 36.7 50.0 39.9
SGCG-4 (neg) 36.7 46.7 45.2 75.0 52.3
Suffix-1 (bb) 69.1 53.3 88.1 75.0 71.3
Suffix-2 (bb) 69.1 76.7 100.0 83.3 82.3
Prefix (bb) 23.8 63.3 33.3 41.5 40.5
Patch-Object (wb) 64.0 66.8 77.8 94.6 75.4
Patch-Arm (wb) 100.0 100.0 100.0 100.0 100.0
Gaussian (bb) 21.4 86.7 19.1 66.7 48.5
S&P (bb) 76.2 96.7 83.3 83.3 84.9
Cross-Misalign 97.6 95.6 96.7 100.0 97.5

Key results:

  • Arm-mounted patches cause complete (100%) failure across all settings.
  • Cross-modal misalignment achieves \sim98% FR, indicating severe vulnerability when semantic grounding is attacked.
  • SGCG referential and attribute substitutions are highly effective; scope perturbations are less so.
  • Salt-and-pepper noise is markedly more disruptive than Gaussian corruption.

Qualitative assessment highlights dramatic deviations in task outcome due to minor adversarial perturbations, including misdirected trajectories and misidentification in object selection.

5. Implications for Embodied AI Safety and Robustness

The high susceptibility of current VLA models to semantically guided perturbations demonstrates critical gaps in embodied alignment and robustness. Even superficial textual or visual manipulations can subvert agent behavior in complex tasks.

Noted implications:

  • Adversarial training and cross-modal regularization are currently underexplored but necessary to counteract such vulnerabilities.
  • Robustness constraints (e.g., Lipschitz bounds) on encoders may help mitigate attack success.
  • The restriction to simulation (LIBERO) and a single model type limits empirical generality; expansion to hardware and diverse model architectures is needed.

A plausible implication is that even modest progress in cross-modal robustness, especially at the semantic feature level, could have disproportionate safety benefits in real-world deployments.

6. Future Directions and Limitations

Future research directions, as identified by VLA-Fool's authors, include:

  • Developing multimodal adversarial training regimes and regularization techniques that explicitly target cross-modal consistency.
  • Extending attack and defense evaluations to real-world physical settings for embodied agents.
  • Exploring a wider variety of environments and model architectures to better capture the generality and transferability of adversarial vulnerabilities.

Current limitations:

  • The evaluation is restricted to the LIBERO benchmark and one VLA architecture (OpenVLA).
  • Further work is necessary to support broader claims regarding transferability and the impact on commercial/real robotic deployments.

VLA-Fool offers the first systematic, multimodal adversarial benchmarking for embodied VLA systems, providing a foundation for both robustness research and next-generation safety evaluations in embodied artificial intelligence (Yan et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to VLA-Fool.