Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models (2511.21663v1)

Published 26 Nov 2025 in cs.CV and cs.AI

Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

Summary

The paper introduces ADVLA, a feature-space adversarial attack framework that disrupts vision-language-action models by misaligning projected features.
It employs three attention-guided strategies—AW, TKM, and TKL—to optimize patch-wise perturbations, ensuring minimal visual artifacts and high stealth.
ADVLA achieves nearly 100% failure rates with less than 10% patch perturbation and rapid execution (approximately 0.06 sec per image), highlighting significant safety risks.

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Introduction and Motivation

Vision-Language-Action (VLA) models have become pivotal in embodied intelligence, enabling systems that can align linguistic input with visual perception to drive motoric action in real time. While their deployment in real-world robotic scenarios—such as manipulation, assisted assembly, and human-robot interaction—has increased, these models' robustness to adversarial perturbations remains a crucial unsolved challenge with direct safety implications. Prior attacks on VLA architectures, specifically pixel-level or patch-based manipulations, often result in high computational cost, conspicuous visual artifacts, or limited efficacy in the feature-space regime. The paper introduces ADVLA, a framework engineered to conduct efficient, sparse, and visually inconspicuous adversarial attacks directly in the feature space aligned with downstream language and action primitives in modern VLA models (2511.21663).

Figure 1: ADVLA pipeline with main attack loop, attention-guided mask generation, and the three mask strategies for perturbation update and loss calculation.

ADVLA: Framework and Methodology

ADVLA is formulated as a gray-box digital-domain adversarial attack targeting the visual encoder of VLA systems (specifically evaluated with OpenVLA). The threat model grants access to visual encoder weights, attention maps, and projected features, but explicitly excludes access to downstream LLM and action predictor parameters. The goal is to generate adversarial examples by minimizing the similarity in the projected feature space (visual-to-text alignment), thereby disrupting action inference.

Core elements include:

Projected Gradient Descent (PGD) in Feature Space: Perturbations are crafted iteratively on input images to induce maximal misalignment between the clean and adversarially projected visual representations, bounded by strict $\ell_\infty$ constraints (e.g., $4/255$ per pixel).
Three Attention-Guided Strategies:
- Attention-Weighted Gradient (AW): Image gradients are modulated by the vision transformer's spatial attention map, focusing update energy on model-attended regions.
- Top-K Masked Gradient (TKM): Gradients are masked such that only the highest attention-weighted patches (e.g., top 10%) are eligible for perturbation update, achieving sparsity and imperceptibility.
- Top-K Loss (TKL): Loss is computed solely over the feature components of the top- $K$ attention patches rather than globally, restricting the adversarial objective to the most sensitive zones.

Each strategy is modular and independently configurable, allowing targeted evaluation of attack sparsity, focus, and stealth.

Experimental Analysis

Benchmark and Evaluation Protocol

The approach is systematically evaluated on the LIBERO suite, covering diverse physical manipulation tasks with the OpenVLA model family as adversarial targets. The primary metric is Failure Rate (FR), with parameter sweeps conducted over perturbation amplitude $\epsilon$ and number of PGD steps.

Quantitative Results

Under $\epsilon=4/255$ and six PGD iterations, ADVLA and variants achieve average FR near 100% across all LIBERO suites, matching or surpassing the prior UADA baseline, which trains highly conspicuous patch attacks in a costly end-to-end manner. Critically, ADVLA-TKM perturbs less than 10% of image patches, yet yields a nearly perfect attack rate. Performance remains strong for lower $\epsilon$ and shorter step counts but is amplified by more iterative budgets and relaxed norm bounds.

Efficiency and Practicality

The computational budget is substantially reduced. ADVLA completes a single image attack in ~0.06 seconds, orders of magnitude faster than UADA’s multi-hour training requirement for one patch. This efficiency arises from the closed-form PGD attack in feature space, removing the need for full-model backpropagation or training of explicit adversarial artifacts.

Figure 2: Comparison of image perturbations; (a) UADA introduces visible artifacts, (b) ADVLA global noise is faintly visible, (c) ADVLA-TKM shows almost invisible perturbations on high-attention patches, (d) underlying vision-module attention map.

Visual Inspection

Figure 2 demonstrates the comparative imperceptibility of ADVLA-TKM: while UADA’s global adversarial patching is highly salient, ADVLA restricts perturbations to attention-weighted regions (typically corresponding to manipulators or salient scene objects), yielding nearly undetectable adversarial examples under visual amplification, yet resulting in full system compromise.

Implications and Future Directions

Theoretical Implications

The work exposes an inherent vulnerability in feature projection pipelines used for multi-modal alignment. Attacks localized in the projected feature space can produce cascading effect through the action stack, disrupting inference while circumventing defenses that rely on input reconstruction or mask detection. The attention-guided sparsity principle demonstrates the risk in relying on focused patch-based alignment as it creates natural attack surfaces.

Practical Impacts

Given the minimal perceptibility and rapid execution, ADVLA provides a realistic adversarial benchmark for VLA robustness evaluation. In industrial or medical robotics, such attacks could lead to misexecution without alerting human operators, even in the presence of visual anomaly detectors. As the field moves towards physical deployment, model owners must invest in defense strategies that go beyond input masking or simple adversarial training, potentially requiring robust cross-modal alignment regularization or online detection of feature-level inconsistencies.

Future Work

Integration with physical-world testing is necessary for evaluating transferability.
Extending the framework to complete black-box settings or defense-conditional scenarios.
Developing adaptive defenses grounded in inspecting projection-space consistency or anomaly detection in attention-weighted regions.

Conclusion

ADVLA establishes a new paradigm for fast, efficient, and stealthy adversarial attacks on VLA models, operating in the projected feature space with attention-guided sparsity. The approach yields virtually perfect attack rates under strict imperceptibility constraints, with negligible runtime overhead. These findings underscore the necessity of robust alignment strategies and adversarial-aware training for embodied multi-modal systems, as vulnerabilities at the vision-language fusion boundary can result in catastrophic functional degradation with minimal observable cues.