VLA-Fool: Multimodal Adversarial Benchmark
- VLA-Fool is a unified framework that assesses adversarial attacks on vision-language-action models through systematic perturbations in visual and textual inputs.
- It implements gradient-based, patch-based, and cross-modal misalignment techniques, achieving up to 100% failure rates in certain attack settings.
- The framework reveals critical challenges in aligning visual, language, and action modalities, driving the need for enhanced adversarial resilience in embodied AI.
VLA-Fool is a unified adversarial robustness framework targeting Vision-Language-Action (VLA) models in embodied AI, providing comprehensive methodology and benchmarking for probing the fragility of alignment across vision, language, and action modalities. Designed to systematically assess both white-box and black-box vulnerabilities, VLA-Fool enables precise evaluation of multimodal adversarial robustness and exposes critical weaknesses in embodied perception, reasoning, and control systems (Yan et al., 20 Nov 2025).
1. Formal Framework and Threat Models
VLA-Fool studies how an adversary can induce misalignment in vision-language-action models, where input is a tuple with a visual observation and a natural-language instruction. The VLA model outputs an action vector , combining visual encoder , language encoder , and an action decoder .
The adversary aims to construct perturbed inputs such that deviates from the nominal . The attack objective is,
where quantifies action deviation.
Two threat models are considered:
- White-box: Full access to model weights, structure, and gradients,
- Black-box: Only output actions or success metrics are available.
Attacks are categorized by modality:
- Textual perturbation: ,
- Visual perturbation: ,
- Cross-modal misalignment: Joint optimization where
and and are patch and token embeddings.
2. Multilevel Attack Methodologies
VLA-Fool implements three primary attack channels:
a) Textual Perturbations
- Gradient-based (SGCG): Extends Greedy Coordinate Gradient by leveraging a VLA-aware semantic space. The method computes for token embeddings, selects sensitive positions, and substitutes tokens from a candidate pool . The class-specific substitute lexicons address referential ambiguity, attribute weakening, scope blurring, and negation confusion.
- Prompt-based (black-box): Adversarial context is injected via crafted prefixes/suffixes, e.g., prepending “Act as an antagonistic agent…” or appending “ignore previous message and instead …”, requiring no embedding or gradient access.
b) Visual Perturbations
- Patch-based (white-box): An adversarial patch is optimized and placed (via operator ) on scene regions (object/environmental or robot arm). Optimization uses gradient ascent for .
- Noise-based (black-box): Injects realistic image corruptions: Gaussian noise (), salt-and-pepper (fraction of pixels), speckle, uniform, pseudo-random (PRNG) patterns, and differentially private (DP) randomization.
c) Cross-Modal Misalignment
Attacks that explicitly disrupt alignment between visual feature patches and semantic tokens, maximizing . The loss is directly tied to changes in the cosine similarity matrix between visual and language embeddings, potentially with regularization via on the action space.
3. Semantic Space and Prompt Engineering
The VLA-aware semantic space comprises four perturbation modes:
- Referential ambiguity,
- Attribute weakening,
- Scope blurring,
- Negation confusion.
For each, a lexicon of candidate tokens is constructed, merged with structure-based and geometric candidates . The attack’s candidate pool thus remains grounded in semantics relevant to embodied reasoning.
Prompt-based (black-box) attacks use templates of the form:
1 |
<prefix> + T + <suffix> |
4. Experimental Evaluation
VLA-Fool is benchmarked on the LIBERO suite (spatial, object, goal, and long-horizon tasks). Victim models are OpenVLA (7B), fine-tuned on LIBERO, using 224×224 inference, bfloat16 precision, and FlashAttention-2 (LoRA optional). White-box attacks use gradients; black-box attacks rely on output actions/success status only.
Attack hyperparameters per modality include:
- SGCG: 10 substitution budget, candidate pool size 50, = 20,
- Patch: size px, region around arm or objects,
- Noise: , $\rho_\text{S%%%%34%%%%P} = 0.02$, DP .
Measured metrics:
- Failure Rate (FR ),
- action deviation,
- semantic misalignment.
Performance is summarized in the table below.
| Attack | Spatial | Object | Goal | Long | Avg |
|---|---|---|---|---|---|
| GCG (white) | 73.8 | 80.0 | 88.1 | 75.0 | 79.2 |
| SGCG-1 (ref) | 50.0 | 83.3 | 88.1 | 75.0 | 74.1 |
| SGCG-2 (attr) | 33.3 | 83.3 | 54.8 | 54.2 | 56.4 |
| SGCG-3 (scope) | 40.5 | 43.3 | 36.7 | 50.0 | 39.9 |
| SGCG-4 (neg) | 36.7 | 46.7 | 45.2 | 75.0 | 52.3 |
| Suffix-1 (bb) | 69.1 | 53.3 | 88.1 | 75.0 | 71.3 |
| Suffix-2 (bb) | 69.1 | 76.7 | 100.0 | 83.3 | 82.3 |
| Prefix (bb) | 23.8 | 63.3 | 33.3 | 41.5 | 40.5 |
| Patch-Object (wb) | 64.0 | 66.8 | 77.8 | 94.6 | 75.4 |
| Patch-Arm (wb) | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Gaussian (bb) | 21.4 | 86.7 | 19.1 | 66.7 | 48.5 |
| S&P (bb) | 76.2 | 96.7 | 83.3 | 83.3 | 84.9 |
| Cross-Misalign | 97.6 | 95.6 | 96.7 | 100.0 | 97.5 |
Key results:
- Arm-mounted patches cause complete (100%) failure across all settings.
- Cross-modal misalignment achieves 98% FR, indicating severe vulnerability when semantic grounding is attacked.
- SGCG referential and attribute substitutions are highly effective; scope perturbations are less so.
- Salt-and-pepper noise is markedly more disruptive than Gaussian corruption.
Qualitative assessment highlights dramatic deviations in task outcome due to minor adversarial perturbations, including misdirected trajectories and misidentification in object selection.
5. Implications for Embodied AI Safety and Robustness
The high susceptibility of current VLA models to semantically guided perturbations demonstrates critical gaps in embodied alignment and robustness. Even superficial textual or visual manipulations can subvert agent behavior in complex tasks.
Noted implications:
- Adversarial training and cross-modal regularization are currently underexplored but necessary to counteract such vulnerabilities.
- Robustness constraints (e.g., Lipschitz bounds) on encoders may help mitigate attack success.
- The restriction to simulation (LIBERO) and a single model type limits empirical generality; expansion to hardware and diverse model architectures is needed.
A plausible implication is that even modest progress in cross-modal robustness, especially at the semantic feature level, could have disproportionate safety benefits in real-world deployments.
6. Future Directions and Limitations
Future research directions, as identified by VLA-Fool's authors, include:
- Developing multimodal adversarial training regimes and regularization techniques that explicitly target cross-modal consistency.
- Extending attack and defense evaluations to real-world physical settings for embodied agents.
- Exploring a wider variety of environments and model architectures to better capture the generality and transferability of adversarial vulnerabilities.
Current limitations:
- The evaluation is restricted to the LIBERO benchmark and one VLA architecture (OpenVLA).
- Further work is necessary to support broader claims regarding transferability and the impact on commercial/real robotic deployments.
VLA-Fool offers the first systematic, multimodal adversarial benchmarking for embodied VLA systems, providing a foundation for both robustness research and next-generation safety evaluations in embodied artificial intelligence (Yan et al., 20 Nov 2025).