Papers
Topics
Authors
Recent
2000 character limit reached

Cross-Modal Transferable Adversarial Attacks

Updated 1 December 2025
  • The paper demonstrates that adversarial perturbations crafted in one modality can successfully mislead models in different modalities due to shared feature representations.
  • It details methodologies like I2V, Medusa, and SGA that optimize perturbations via surrogate ensembles and cross-modal feature divergence to maximize attack transferability.
  • Rigorous experiments show attack success rates as high as 98%, highlighting vulnerabilities in multimodal systems and the urgent need for robust, modality-adaptive defenses.

Cross-modal transferable adversarial attacks constitute a family of attack techniques wherein adversarial perturbations crafted in one input modality (e.g., images) are transferable and effective at confusing machine learning models operating in a different modality (e.g., videos, vision-language systems, or multimodal LLMs). This cross-modal transferability presents substantial concerns for the security and robustness of modern deep neural networks deployed in real-world, heterogeneous environments. The defining characteristic of such attacks is the deliberate exploitation of overlapping or shared feature spaces between disparate modalities, targeting the architectural entanglements that enable latent alignment throughout cross-modal pipelines.

1. Theoretical Foundations and Scope

Cross-modal transferable adversarial attacks fundamentally extend the classical notion of adversarial transferability—typically understood as the effectiveness of perturbations crafted for one model against another—into the multi-modality regime. The key insight is that state-of-the-art recognition, reasoning, and generative architectures for video, language, retrieval-augmented generation, or sensor-fusion tasks all contain layers or modules with feature distributions homologous to those in single-modality (often image-based) models. For video recognition, for instance, spatial backbones are often pre-trained on images and reused in video models, resulting in early-stage representations with near-identical filter responses (Wei et al., 2021). Likewise, vision-language pretraining models (VLPs) leverage shared embedding spaces (e.g., CLIP), aligning diverse modalities in highly structured joint manifolds (Lu et al., 2023, Fu et al., 16 Mar 2024).

This foundational overlap, often combined with explicit weight reuse or architectural analogies (e.g., transformer-based pooling over both frames and tokens), permits adversarial manipulations crafted for a source modality (e.g., images) or architecture (e.g., image-caption retrieval models) to “survive” the transition into a black-box target model in a different modality or downstream task (Wei et al., 2021, Huang et al., 2 Jan 2025, Gotin et al., 14 Jan 2025).

2. Algorithmic Formulations and Attack Methodologies

Central to cross-modal attacks are the loss design and perturbation propagation strategies that maximize transferability against black-box targets. Notable methodologies include:

A. Feature Divergence Attacks:

The I2V attack (Wei et al., 2021) generates image-level perturbations by minimizing cosine similarity between features extracted from clean and perturbed images under a white-box image model (e.g., ResNet-50, Inception-v3). This is formalized as:

LI2V(x,δ)=1ϕI(x),ϕI(x+δ)ϕI(x)2ϕI(x+δ)2+λδ22L_{I2V}(x, \delta) = 1 - \frac{\langle \phi_I(x), \phi_I(x+\delta) \rangle}{\|\phi_I(x)\|_2 \cdot \|\phi_I(x+\delta)\|_2} + \lambda \|\delta\|_2^2

subject to δϵ\|\delta\|_\infty \leq \epsilon, where ϕI\phi_I indicates feature extraction at a designated layer.

B. Surrogate Model Ensemble with Invariant Risk Minimization:

Medusa (Shang et al., 24 Nov 2025) expands this approach for medical retrieval-augmented generation, formulating the attack as minimization of a multi-positive InfoNCE loss across an ensemble of image-text CLIP-style models, augmented by an IRM penalty to enforce perturbation stability across surrogate models. The goal is to drive the adversarial image embedding close to attacker-chosen target text embeddings while distancing it from the true report.

C. Cross-modality Gradient and Interaction Guidance:

SGA (Lu et al., 2023) and CMI-Attack (Fu et al., 16 Mar 2024) introduce multi-part or set-level optimization, alternating between adversarial perturbation of one modality and gradient-based updates on the counterpart. SGA, for example, perturbs an image by maximizing divergence from multiple adversarial captions (drawn from a caption pool), then updates caption perturbations in response to the current adversarial image.

D. Propagative and Temporal Consistency Approaches:

Video-centric cross-modal attacks such as I2V-MLLM (Huang et al., 2 Jan 2025) and IC2VQA (Gotin et al., 14 Jan 2025) address the need for temporal consistency and robustness to frame sampling. Perturbations are first optimized with reference to a surrogate image- or frame-based model, then propagated or temporally smoothed across video frames to maximize the likelihood of misleading black-box video or video-LLMs.

E. Semantic Diversification and Physical Attack Variants:

In multimodal LLMs, techniques such as typographical augmentation (TSTA) overlay random semantic content during the attack optimization, forcing wider coverage of the model’s semantic attention and improving transfer to diverse black-box LLM architectures (Cheng et al., 30 May 2024). In the physical world, contour-based patch optimization for both visible and infrared domains achieves universal deception across sensor types (Wei et al., 2023).

3. Experimental Insights and Taxonomy of Targets

Experimental paradigms span a wide spectrum of domains and target architectures. The principal tasks and results include:

Attack Source Modality Target Modality / Task Key Target Models Peak Transfer ASR
I2V (Wei et al., 2021) Image Video classification C3D, I3D, TSN, SlowFast 62–79%
Medusa (Shang et al., 24 Nov 2025) Image Medical retrieval-aug. generation PMC-CLIP, MONET 90–98% (PMC-CLIP)
SGA (Lu et al., 2023) Image+Text Vision–language retrieval ALBEF→TCL +30pp over baseline
CMI-Attack (Fu et al., 16 Mar 2024) Image+Text VLPs and captioning ALBEF→TCL/CLIP +8–16pp over SGA
VQAttack (Yin et al., 16 Feb 2024) Image+Text VQA (visual question answering) VQA v2, TextVQA 34–79% (vs. ~30% SOTA)
IC2VQA (Gotin et al., 14 Jan 2025) Image (frames) Video quality metrics (VQA) VSFA, MDTVSFA PLCC/SRCC drop:
X-Transfer (Huang et al., 8 May 2025) Image CLIP/VLMs (unified) 64 CLIP/VLM variants 69–75% ASR
TSTA (Cheng et al., 30 May 2024) Image Multimodal LLMs (harmful word insert) MiniGPT-4, LLaVA 3–5× ASR over baseline
CrossFire (Dou et al., 10 Sep 2024) Image/Audio Multimodal retrieval/generation ImageBind, PandaGPT 76–98% (image), 86–94% (audio)

Attack Success Rate (ASR) in this context measures the fraction of inputs for which the adversarial example forces an output change in the black-box model (e.g., label flip, targeted generation, retrieval rank drop).

4. Key Empirical Findings and Ablations

Substantial empirical analyses identify the principal factors underlying effective cross-modal transfer:

  • Feature Homology: Transferability strongly depends on the degree of feature map alignment—adversarial perturbations optimized to disrupt shared low-level representations generalize well; perturbations crafted on deeper, modality-specific layers are less effective (Wei et al., 2021, Gotin et al., 14 Jan 2025).
  • Ensemble and Multi-Positive Guidance: Surrogate ensembles covering diverse pre-training domains and architectures boost attack robustness through covering model-specific embedding variances (Shang et al., 24 Nov 2025).
  • Many-to-Many and Set-Level Guidance: Attacks that operate jointly over multiple image/text augmentations or layouts outperform those relying on single-pair interactions (Lu et al., 2023, Fu et al., 16 Mar 2024).
  • Semantic Diversification: For multimodal LLMs, the injection of random semantic elements (e.g., random word “typography,” alternative prompts) significantly widens the space of transferable perturbations, as illustrated by the drop in CLIPScore and increased ASR against previously unseen prompts (Cheng et al., 30 May 2024).
  • Optimization Strategies: Surrogate scaling (multi-armed bandit selection over large surrogate sets) allows efficient computation of universal adversarial perturbations with “super transferability” across data, model, and task (Huang et al., 8 May 2025).

Ablation studies confirm that removal of key components, such as IRM regularization, prompt diversity, or cross-modal guidance, results in 5–20% reductions in ASR, highlighting their complementarity and necessity.

5. Defense Mechanisms and Robustness Evaluations

Evaluated defenses target both the input and feature levels:

No current defense achieves comprehensive mitigation of cross-modal transferability; universal and semantic-level attacks remain robust to most remediation techniques.

6. Implications, Limitations, and Open Challenges

Cross-modal transferable adversarial attacks expose critical vulnerabilities in architectures that leverage shared or aligned representations across vision, language, and sensor modalities. The transferability phenomena extend the black-box model threat surface: adversaries need only access to white-box proxies in one modality to compromise a range of black-box systems in another, including medical RAG systems, multimodal LLMs, and physical sensor networks. Successful attacks are feasible with imperceptible perturbations or physically plausible patches, and are elevated by ensemble, multi-positive, and semantic diversification techniques.

Key limitations include computational overhead (multi-model, multi-modal backpropagation), the need for noise-robust optimization in the presence of quantization or physical-world bottlenecks, and occasional decreased efficacy under extreme architectural mismatches (e.g., very different input preprocessing or feature extractors).

Open research directions encompass: robust multimodal adversarial training, certified cross-modal robustness, deep investigation of feature manifold geometry across modalities, and principled detection or remediation of semantic-diversifying perturbations. The persistent high ASR values in current benchmarks underscore an urgent need for integrative, modality-adaptive defense strategies (Shang et al., 24 Nov 2025, Huang et al., 8 May 2025, Dou et al., 10 Sep 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Cross-Modal Transferable Adversarial Attacks.