Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Compositional Attacks on Vision-Language Models

Updated 27 October 2025
  • The paper reveals that compositional adversarial attacks effectively manipulate both image and text modalities to disrupt cross-modal reasoning in VLMs, achieving over 90% attack success rate.
  • It details a gradient-based methodology that perturbs visual features through PGD-like updates and textual embeddings via iterative beam search refinements.
  • The study highlights vulnerabilities in attention, localization, and compositional structures, urging the need for integrated defense strategies in VLMs.

Compositional adversarial attacks on Vision-LLMs (VLMs) refer to methods that craft adversarial examples by simultaneously perturbing multiple modalities—such as the image and its associated text—or by targeting the internal mechanisms that integrate vision and language, with the goal of breaking the compositional reasoning and alignment capacities of the model. These attacks expose unique vulnerabilities in VLM architectures, including their attention, localization, and multimodal fusion modules, by misaligning visual and linguistic features or disrupting their interactions. Recent research demonstrates the feasibility and high effectiveness of such attacks, raising fundamental challenges for the robustness, safety, and deployment of VLMs in real-world scenarios.

1. Attack Methodologies: Compositional Perturbations Across Modalities

Compositional adversarial attacks on VLMs manipulate both image and text inputs to jointly deceive the model’s cross-modal reasoning. The primary methodology employs gradient-based optimization for both modalities. For the visual pathway, an image II is iteratively perturbed to IadvI_{\text{adv}} within an p\ell_p-ball constraint using a PGD-like update: Iadv(t+1)=ClipI,ϵ{Iadv(t)+αsign(IL(Iadv(t),y))}I_{\text{adv}}^{(t+1)} = \operatorname{Clip}_{I,\epsilon}\Bigl\{ I_{\text{adv}}^{(t)} + \alpha \cdot \operatorname{sign}(\nabla_I L(I_{\text{adv}}^{(t)}, y)) \Bigr\} where LL is the model loss, α\alpha is the step size, and ϵ\epsilon bounds perturbation strength (Xu et al., 2017).

For the language modality, textual adversarial examples are crafted by (i) perturbing the continuous input embedding TT to obtain Tadv=T+δT_{\text{adv}} = T + \delta where δpϵ\|\delta\|_p \leq \epsilon', and (ii) discretizing the perturbed embedding back into a plausible text sequence using an iterative refinement (e.g., beam search combined with gradient signals).

The attacks are compositional because they optimize a joint objective spanning both branches and their interaction: Ltotal=Lv(Iadv,y)+λLt(Tadv,y)+μLcross(Iadv,Tadv)L_{\text{total}} = L_v(I_{\text{adv}}, y) + \lambda L_t(T_{\text{adv}}, y) + \mu L_{\text{cross}}(I_{\text{adv}}, T_{\text{adv}}) where LvL_v and LtL_t are the vision and text losses, LcrossL_{\text{cross}} explicitly penalizes misalignment between modalities, and λ,μ\lambda, \mu balance the losses (Xu et al., 2017).

This approach directly attacks the cross-modal compositional functions central to VLMs and is empirically shown to be generalizable across models and tasks.

2. Model Vulnerabilities: Attention, Localization, and Compositionality

VLMs are shown to be acutely vulnerable in several architectural modules:

  • Attention Mechanisms: Small input perturbations can cause large changes in attention distributions due to the exponential sensitivity of softmax operations. This misdirects the model’s focus away from the true objects or phrases—resulting in critical misinterpretations. For example, an adversarial perturbation can force the model’s attention from a "red ball" to a "blue table", yielding a wrong answer (Xu et al., 2017).
  • Localization and Dense Captioning: Subtle manipulations of the image shift bounding box predictions or region-of-interest features, leading to localization errors and incorrect object–attribute pairings in captions ("cat under a table" becoming "cat near a table") (Xu et al., 2017).
  • Compositional Structures: VLMs excel by leveraging language’s compositionality, mapping objects, attributes, and relations from the image to text. Attacks that break these compositional alignments cause the model to omit or misstate relationships, exposing systemic weaknesses in reasoning and fusion modules.

Case studies in VQA and dense captioning highlight that minimal, imperceptible input perturbations can propagate through internal representations and result in final outputs that are both plausible and factually incorrect, underscoring the non-triviality of defending against such attacks.

3. Quantitative Metrics for Evaluating Attack Success

Several rigorous metrics have been developed to measure the efficacy and systematic impact of compositional adversarial attacks:

Metric Name Evaluates Reported Result
Attack Success Rate (ASR) Fraction of adversarial inputs causing wrong output >90% (Xu et al., 2017)
Attention Distribution Δ Shift in attention heatmaps pre/post attack Large shifts observed
Localization Error (IoU) Perturbed vs. original bounding boxes Significant degradation
Compositional Consistency Preservation of object–attribute–relation pairs Often much reduced

The paper in (Xu et al., 2017) reports an ASR exceeding 90% across benchmark VQA and captioning tasks, indicating that the vast majority of adversarial examples succeed in causing misclassification or miscaptioning. Additional metrics such as attention heatmap shifts and intersection-over-union (IoU) measure the subtlety and semantic fidelity of the perturbation, and a newly proposed compositional consistency score quantifies the alignment of object–attribute relationships between the clean and adversarial outputs.

4. Impact on Downstream Tasks and Systemic Risks

The impact of compositional adversarial attacks extends broadly across VLM-powered applications, especially dense captioning and VQA:

  • VQA: Adversaries can force the model to focus on irrelevant image regions or to misinterpret the input question, yielding answers inconsistent with the visual evidence (e.g., answering “blue” when the image shows a red ball).
  • Dense Captioning: Adversarial examples can induce captions that not only omit key objects or relations but can entirely swap spatial or attribute associations, fundamentally degrading output utility and potentially resulting in dangerous miscommunication in critical settings (e.g., autonomous vehicles, robotics).

Visualizations such as attention heatmaps and “before/after” captions quantitatively and qualitatively illustrate that attacks can misdirect internal focus or force the omission of core content while maintaining surface fluency.

5. Implications for Model Development and Future Research

The demonstrated vulnerabilities in attention, localization, and compositional reasoning highlight several key directions for future work:

  • Adversarial Training: Robustification efforts must consider not just unimodal perturbations but also multimodal and cross-modal attacks. Training under worst-case perturbations for both image and text modalities, as well as their interaction, may be essential to close the robustness gap.
  • Attention Map Regularization: Stability constraints on attention maps—requiring consistent focus on key objects and relationships under input perturbation—could mitigate mislocalized and misdirected responses.
  • Hybrid Verification and Rule-based Modules: Integration of discriminative models with structured reasoning modules (e.g., knowledge graphs or rule-checkers) may catch compositional inconsistencies even when both modalities are individually plausible but misaligned.
  • Evaluation and Benchmarks: The establishment of dedicated compositional adversarial benchmarks will incentivize the design of inherently robust architectures and offer standardization for robustness assessment.
  • Defense Paradigm Shift: Single-modal defense strategies do not address the interdependent vulnerabilities in VLMs. Future models must adopt integrated defense paradigms that recognize the interplay between vision and language fusion.

These directions map the landscape for the development of next-generation robust VLMs, emphasizing the interdependence of modalities and architectural modules.

6. Summary

Compositional adversarial attacks on VLMs represent a critical threat surface arising from the unique interplay and fusion of visual and linguistic modalities. By attacking the architecture’s internal mechanisms—attention allocation, localization, and semantic composition—these attacks cause both subtle and severe output errors while evading conventional defense strategies. Quantitative evidence (>90% ASR) underscores the urgency for advancing adversarially robust multimodal architectures. Addressing these vulnerabilities requires fundamentally new training paradigms, interpretability tools, and evaluation regimes that view multimodal robustness as a first-class design property (Xu et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Compositional Adversarial Attacks on Vision-Language Models (VLMs).