Adversarial Robustness in Visual Grounding

Updated 23 September 2025

Adversarial robustness for visual grounding is the ability of multimodal systems to reliably map language descriptions to image regions despite subtle adversarial perturbations.
Research focuses on threat models and countermeasures, including causal interventions and decoupled feature masking, to mitigate vulnerabilities in both image and text inputs.
Evaluation protocols using metrics like ASR, IoU, and MMI drive advances in architectural innovations and adversarial pre-training, enhancing overall model resilience.

Adversarial robustness for visual grounding refers to the ability of models that map natural language descriptions to specific regions in images to resist imperceptible or malicious perturbations in either input modality. Such robustness is of critical importance for tasks like image captioning, referring expression comprehension, and multimodal entity linking, where outputs are directly affected by subtle manipulation of pixels or language. Research in this field investigates both the vulnerabilities of current visual grounding architectures and algorithmic interventions to improve resilience against attacks. The following sections detail the foundational concepts, principal attack and defense methodologies, evaluation protocols, and implications across classic and modern model families.

1. Threat Models and Adversarial Attack Paradigms

The space of adversarial threats to visual grounding includes both direct image-level and text-level attacks, as well as more sophisticated multimodal manipulations and semantic backdoors. Classic approaches, such as the Show-and-Fool algorithm (Chen et al., 2017), formulate the attack as constrained optimization—minimizing a loss that steers the output (caption or region) towards a target while bounding the $\ell_2$ distortion:

$\min_{\delta} \; c \cdot \text{loss}(I + \delta) + \|\delta\|_2^2 \quad \text{subject to } I + \delta \in [-1,1]^n$

with unconstrained variants using change-of-variable tricks for numerical stability.

For multimodal LLMs (MLLMs), attack paradigms now encompass:

Untargeted attacks that maximize deviation between the visual embedding of a clean and adversarial image, or minimize the log-likelihood of generating the true bounding box (Gao et al., 16 May 2024).
Exclusive targeted attacks which force all referring queries to map to a single, arbitrarily chosen region.
Permuted targeted attacks that cyclically permute region outputs among objects within a scene.

Backdoor attacks such as IAG (Li et al., 13 Aug 2025) leverage text-aware trigger generators (text-conditional U-Net) to implant imperceptible, semantically meaningful triggers, forcing models to ground any query on a specific target with high success rates ([email protected] reaching over 65%).

Textual adversarial samples, including noun, numeral, and relation substitutions (Shi et al., 2018) or property-reducing expressions (Chang et al., 2 Mar 2024), are viable threat vectors. Many current visual grounding models exhibit surprising vulnerability even in black-box settings, with metric drops under attack (e.g., up to 21.4% MMI on RefCOCO/+/g for the OFA-VG model under PEELING).

2. Defense Methodologies and Regularization Strategies

Adversarial defense research targets either feature-level or architectural resilience.

Causal Intervention

The CiiV regularization (Tang et al., 2021) adopts a causal inference viewpoint, suppressing the influence of spurious confounders by augmenting images with retinotopic masks and enforcing consistency of model outputs under varying spatial samplings:

$L_{\text{CiiV}} = \sum_{r_i \ne r_j} \|\alpha_{r_j} Y[X = x_{r_i}] - \alpha_{r_i} Y[X = x_{r_j}]\|$

where $\alpha_{r_j}$ denotes normalized spatial coverage.

Decoupled Feature Masking

DFM blocks (Liu et al., 16 Jun 2024) disentangle discriminative and non-discriminative visual features with diverse binary masking strategies within network layers, optimizing intra-class diversity and inter-class discriminability: $\hat{f}_i = c_{1,i} \odot M_1 + c_{2,i} \odot M_2$ This approach disrupts adversarial noise, enhancing the separation and robustness of learned representations.

Visual Prompting

Visual Prompting (VP) and its class-wise extension (C-AVP) (Chen et al., 2022) learn class-discriminative prompt patterns for input transformation, substantially increasing robust accuracy (2× improvement over vanilla VP) and inference efficiency (42× speedup compared to classical defenses).

Adversarial Pre-training and Instruction Tuning

Double Visual Defense (Wang et al., 16 Jan 2025) executes adversarially constrained CLIP pre-training on large-scale web data, followed by adversarial visual instruction tuning. This combination yields robust models such as $\Delta$ CLIP and $\Delta^2$ LLaVA, which outperform previous methods by 20–30% in adversarial settings while retaining high zero-shot performance.

Hierarchical Modulation and Context Disentanglement

Methods like HiVG (Xiao et al., 20 Apr 2024) and TransCP (Tang et al., 2023) employ hierarchical cross-modal bridging and disentangling of referential/contextual features, as well as prototype bank inheritance to anchor predictions, mitigating sensitivity to adversarial noise and supporting open-vocabulary robustness.

3. Evaluation Protocols and Metrics

Robustness assessment is performed via a suite of metrics:

Attack Success Rate (ASR): Percentage of instances for which the attack causes incorrect grounding.
Intersection over Union ([email protected]): Quantifies bounding box overlap; lower is better for untargeted attacks, higher for targeted attacks.
MultiModal Impact (MMI): $(A_o - A_a)/A_o$ measures adversarial drop in accuracy.
False Alarm Discovery Rate ( $R_{fad}$ ) and Mixed Correct Rate ( $R_{mix}$ ): Newly proposed for robust VG datasets (Li et al., 2023) to assess rejection or acceptance of false-alarm examples.
Image-to-caption retrieval metrics (BLEU, ROUGE, METEOR) for captioning tasks (Chen et al., 2017).
Entity linking accuracy for MEL models (Wang et al., 21 Aug 2025):

Attack Type	Metric(s)	Typical Drop
Image PGD (8/255)	Linking Accuracy	15-30%-pts
Textual PEELING	MMI, IoU, ATCR	~21.4% MMI, ATCR~90%
Backdoor (IAG)	[email protected]	≥65% (InternVL-2.5-8B)

Table: Representative evaluation metrics and observed adversarial impact

Transferability is routinely assessed by evaluating attacks on multiple architectures and datasets, with high transfer rates observed for both adversarial examples (Chen et al., 2017) and backdoors (Li et al., 13 Aug 2025).

4. Architectural Innovations for Robust Grounding

Recent work converges towards modular and hierarchical strategies:

Hierarchical Contextual Grounding LVLM (HCG-LVLM) (Guo et al., 23 Aug 2025): Adopts a two-layer architecture with coarse global perception and fine-grained local grounding. The semantic consistency validator computes similarity between local region features and textual embeddings, penalizing hallucinations and enforcing robust visual-language alignment.
Iterative Robust Visual Grounding (IR-VG) (Li et al., 2023): Uses multi-level vision-language fusion and masked centerpoint supervision to robustly localize under adversarial, noisy, or inaccurate queries, improving $R_{fad}$ (25% gain) and $R_{mix}$ (10% gain) over SOTA baselines.

Contrastive and multi-task learning are viable options for enforcing sensitivity to text structure and spatial reasoning, with documented improvements for ViLBERT (Akula et al., 2020).

5. Adversarial Robustness in Application Domains

GUI Grounding: GUI agents exhibit susceptibility to both natural noise and adversarial perturbations with significant performance degradation in low-resolution and complex interface settings (Zhao et al., 7 Apr 2025).
Entity Linking: MEL models, when attacked visually, see accuracy drops up to 30%, counteracted only partially by context augmentation and retrieval-augmented linking (Wang et al., 21 Aug 2025).
VLM Backdoors: Adaptive triggers allow for high attack success with negligible clean accuracy impact, complicating the defense landscape (IAG (Li et al., 13 Aug 2025)).
Multimodal LLMs: MLLMs (e.g., MiniGPT-v2) are highly vulnerable to both untargeted and targeted bounding box attacks, indicating the necessity for integrated visual-linguistic adversarial training (Gao et al., 16 May 2024).

6. Implications and Future Directions

Current research demonstrates that visual grounding systems remain fundamentally vulnerable to multimodal adversarial perturbations, backdoors, and context-aware attacks. The most effective remedies blend robust feature design (CiiV, DFM), hierarchical fusion (HiVG, HCG-LVLM), and large-scale adversarial pre-training (ΔCLIP). Dynamic context enrichment and retrieval, as in LLM-RetLink (Wang et al., 21 Aug 2025), present a promising mitigation strategy for entity linking and richer grounding tasks.

Open questions persist regarding:

Generalization to unseen attack types and zero-shot scenarios.
Joint adversarial training of both vision and language components.
Certification and theoretical guarantees for robustness (cf. decoupled feature masking).
Defense against adaptive, input-aware backdoors in open-vocabulary and generative models.

A plausible implication is that robust visual grounding will increasingly rely on hybrid strategies encompassing hierarchical architectural design, multimodal regularization, context retrieval, and continual adversarial evaluation. As models evolve toward universal multimodal agents, adversarial robustness must remain a first-class objective in both training and deployment.