Cross-Modal Transferable Adversarial Attacks

Updated 1 December 2025

The paper demonstrates that adversarial perturbations crafted in one modality can successfully mislead models in different modalities due to shared feature representations.
It details methodologies like I2V, Medusa, and SGA that optimize perturbations via surrogate ensembles and cross-modal feature divergence to maximize attack transferability.
Rigorous experiments show attack success rates as high as 98%, highlighting vulnerabilities in multimodal systems and the urgent need for robust, modality-adaptive defenses.

Cross-modal transferable adversarial attacks constitute a family of attack techniques wherein adversarial perturbations crafted in one input modality (e.g., images) are transferable and effective at confusing machine learning models operating in a different modality (e.g., videos, vision-language systems, or multimodal LLMs). This cross-modal transferability presents substantial concerns for the security and robustness of modern deep neural networks deployed in real-world, heterogeneous environments. The defining characteristic of such attacks is the deliberate exploitation of overlapping or shared feature spaces between disparate modalities, targeting the architectural entanglements that enable latent alignment throughout cross-modal pipelines.

1. Theoretical Foundations and Scope

Cross-modal transferable adversarial attacks fundamentally extend the classical notion of adversarial transferability—typically understood as the effectiveness of perturbations crafted for one model against another—into the multi-modality regime. The key insight is that state-of-the-art recognition, reasoning, and generative architectures for video, language, retrieval-augmented generation, or sensor-fusion tasks all contain layers or modules with feature distributions homologous to those in single-modality (often image-based) models. For video recognition, for instance, spatial backbones are often pre-trained on images and reused in video models, resulting in early-stage representations with near-identical filter responses (Wei et al., 2021). Likewise, vision-language pretraining models (VLPs) leverage shared embedding spaces (e.g., CLIP), aligning diverse modalities in highly structured joint manifolds (Lu et al., 2023, Fu et al., 16 Mar 2024).

This foundational overlap, often combined with explicit weight reuse or architectural analogies (e.g., transformer-based pooling over both frames and tokens), permits adversarial manipulations crafted for a source modality (e.g., images) or architecture (e.g., image-caption retrieval models) to “survive” the transition into a black-box target model in a different modality or downstream task (Wei et al., 2021, Huang et al., 2 Jan 2025, Gotin et al., 14 Jan 2025).

2. Algorithmic Formulations and Attack Methodologies

Central to cross-modal attacks are the loss design and perturbation propagation strategies that maximize transferability against black-box targets. Notable methodologies include:

A. Feature Divergence Attacks:

The I2V attack (Wei et al., 2021) generates image-level perturbations by minimizing cosine similarity between features extracted from clean and perturbed images under a white-box image model (e.g., ResNet-50, Inception-v3). This is formalized as:

$L_{I2V}(x, \delta) = 1 - \frac{\langle \phi_I(x), \phi_I(x+\delta) \rangle}{\|\phi_I(x)\|_2 \cdot \|\phi_I(x+\delta)\|_2} + \lambda \|\delta\|_2^2$

subject to $\|\delta\|_\infty \leq \epsilon$ , where $\phi_I$ indicates feature extraction at a designated layer.

B. Surrogate Model Ensemble with Invariant Risk Minimization:

Medusa (Shang et al., 24 Nov 2025) expands this approach for medical retrieval-augmented generation, formulating the attack as minimization of a multi-positive InfoNCE loss across an ensemble of image-text CLIP-style models, augmented by an IRM penalty to enforce perturbation stability across surrogate models. The goal is to drive the adversarial image embedding close to attacker-chosen target text embeddings while distancing it from the true report.

C. Cross-modality Gradient and Interaction Guidance:

SGA (Lu et al., 2023) and CMI-Attack (Fu et al., 16 Mar 2024) introduce multi-part or set-level optimization, alternating between adversarial perturbation of one modality and gradient-based updates on the counterpart. SGA, for example, perturbs an image by maximizing divergence from multiple adversarial captions (drawn from a caption pool), then updates caption perturbations in response to the current adversarial image.

D. Propagative and Temporal Consistency Approaches:

Video-centric cross-modal attacks such as I2V-MLLM (Huang et al., 2 Jan 2025) and IC2VQA (Gotin et al., 14 Jan 2025) address the need for temporal consistency and robustness to frame sampling. Perturbations are first optimized with reference to a surrogate image- or frame-based model, then propagated or temporally smoothed across video frames to maximize the likelihood of misleading black-box video or video-LLMs.

E. Semantic Diversification and Physical Attack Variants:

In multimodal LLMs, techniques such as typographical augmentation (TSTA) overlay random semantic content during the attack optimization, forcing wider coverage of the model’s semantic attention and improving transfer to diverse black-box LLM architectures (Cheng et al., 30 May 2024). In the physical world, contour-based patch optimization for both visible and infrared domains achieves universal deception across sensor types (Wei et al., 2023).

3. Experimental Insights and Taxonomy of Targets

Experimental paradigms span a wide spectrum of domains and target architectures. The principal tasks and results include:

Attack	Source Modality	Target Modality / Task	Key Target Models	Peak Transfer ASR
I2V (Wei et al., 2021)	Image	Video classification	C3D, I3D, TSN, SlowFast	62–79%
Medusa (Shang et al., 24 Nov 2025)	Image	Medical retrieval-aug. generation	PMC-CLIP, MONET	90–98% (PMC-CLIP)
SGA (Lu et al., 2023)	Image+Text	Vision–language retrieval	ALBEF→TCL	+30pp over baseline
CMI-Attack (Fu et al., 16 Mar 2024)	Image+Text	VLPs and captioning	ALBEF→TCL/CLIP	+8–16pp over SGA
VQAttack (Yin et al., 16 Feb 2024)	Image+Text	VQA (visual question answering)	VQA v2, TextVQA	34–79% (vs. ~30% SOTA)
IC2VQA (Gotin et al., 14 Jan 2025)	Image (frames)	Video quality metrics (VQA)	VSFA, MDTVSFA	PLCC/SRCC drop:
X-Transfer (Huang et al., 8 May 2025)	Image	CLIP/VLMs (unified)	64 CLIP/VLM variants	69–75% ASR
TSTA (Cheng et al., 30 May 2024)	Image	Multimodal LLMs (harmful word insert)	MiniGPT-4, LLaVA	3–5× ASR over baseline
CrossFire (Dou et al., 10 Sep 2024)	Image/Audio	Multimodal retrieval/generation	ImageBind, PandaGPT	76–98% (image), 86–94% (audio)

Attack Success Rate (ASR) in this context measures the fraction of inputs for which the adversarial example forces an output change in the black-box model (e.g., label flip, targeted generation, retrieval rank drop).

4. Key Empirical Findings and Ablations

Substantial empirical analyses identify the principal factors underlying effective cross-modal transfer:

Feature Homology: Transferability strongly depends on the degree of feature map alignment—adversarial perturbations optimized to disrupt shared low-level representations generalize well; perturbations crafted on deeper, modality-specific layers are less effective (Wei et al., 2021, Gotin et al., 14 Jan 2025).
Ensemble and Multi-Positive Guidance: Surrogate ensembles covering diverse pre-training domains and architectures boost attack robustness through covering model-specific embedding variances (Shang et al., 24 Nov 2025).
Many-to-Many and Set-Level Guidance: Attacks that operate jointly over multiple image/text augmentations or layouts outperform those relying on single-pair interactions (Lu et al., 2023, Fu et al., 16 Mar 2024).
Semantic Diversification: For multimodal LLMs, the injection of random semantic elements (e.g., random word “typography,” alternative prompts) significantly widens the space of transferable perturbations, as illustrated by the drop in CLIPScore and increased ASR against previously unseen prompts (Cheng et al., 30 May 2024).
Optimization Strategies: Surrogate scaling (multi-armed bandit selection over large surrogate sets) allows efficient computation of universal adversarial perturbations with “super transferability” across data, model, and task (Huang et al., 8 May 2025).

Ablation studies confirm that removal of key components, such as IRM regularization, prompt diversity, or cross-modal guidance, results in 5–20% reductions in ASR, highlighting their complementarity and necessity.

5. Defense Mechanisms and Robustness Evaluations

Evaluated defenses target both the input and feature levels:

Input Transformations: Random resize+pad, JPEG compression, denoising diffusion, and bit-depth reduction have only partial efficacy (<15% reduction in ASR) (Shang et al., 24 Nov 2025, Dou et al., 10 Sep 2024).
Adversarial Training: Incorporating adversarially perturbed frames or cross-modal examples during training reduces attack success by up to 40% but incurs high computational cost and does not eliminate transfer (Wei et al., 2021, Wei et al., 2023).
Semantic Consistency and Filtering: Proposed future defenses include modal consistency checks, embedding-space smoothing, and cross-modal adversarial training regimes, though these are not yet fully effective (Dou et al., 10 Sep 2024, Shang et al., 24 Nov 2025).

No current defense achieves comprehensive mitigation of cross-modal transferability; universal and semantic-level attacks remain robust to most remediation techniques.

6. Implications, Limitations, and Open Challenges

Cross-modal transferable adversarial attacks expose critical vulnerabilities in architectures that leverage shared or aligned representations across vision, language, and sensor modalities. The transferability phenomena extend the black-box model threat surface: adversaries need only access to white-box proxies in one modality to compromise a range of black-box systems in another, including medical RAG systems, multimodal LLMs, and physical sensor networks. Successful attacks are feasible with imperceptible perturbations or physically plausible patches, and are elevated by ensemble, multi-positive, and semantic diversification techniques.

Key limitations include computational overhead (multi-model, multi-modal backpropagation), the need for noise-robust optimization in the presence of quantization or physical-world bottlenecks, and occasional decreased efficacy under extreme architectural mismatches (e.g., very different input preprocessing or feature extractors).

Open research directions encompass: robust multimodal adversarial training, certified cross-modal robustness, deep investigation of feature manifold geometry across modalities, and principled detection or remediation of semantic-diversifying perturbations. The persistent high ASR values in current benchmarks underscore an urgent need for integrative, modality-adaptive defense strategies (Shang et al., 24 Nov 2025, Huang et al., 8 May 2025, Dou et al., 10 Sep 2024).