Function-word De-Attention in VLMs
- Function-word De-Attention (FDA) is a mechanism that subtracts function-word attention in vision-language models to reduce adversarial distractions and improve cross-modal alignment.
- FDA operates by applying a differential subtraction at the attention head level using a 93-word function-word dictionary, integrated seamlessly into selected fusion encoder layers.
- The approach significantly reduces attack success rates on tasks like image-to-text retrieval and visual grounding, achieving robustness gains up to 93% with negligible accuracy loss.
Function-word De-Attention (FDA) is a mechanism for vision-LLMs (VLMs) that aims to enhance their robustness under adversarial cross-modal attacks by explicitly mitigating the detrimental influence of function words during cross-attention. FDA operates by differentially subtracting attention attributable to function words at the attention head level, yielding improved alignment between modalities and significantly reducing model vulnerability with negligible impact on clean-data accuracy (Tian et al., 8 Dec 2025).
1. Motivation and Empirical Foundations
VLMs, when defended by standard adversarial training, exhibit a pronounced trade-off between robustness and accuracy under image perturbations. Empirical analysis revealed that in targeted Projected Gradient Descent (PGD) attacks on retrieval tasks, 80.3% of perturbed images caused cross-modal attention to focus more on function words (such as “the,” “is,” “and”) than on content words—a behavior absent in clean conditions. Grad-CAM visualizations confirmed that adversarial examples distract cross-modal attention via function words. Notably, removal of function words from textual input largely alleviates this effect without harming clean accuracy. This evidence motivates the hypothesis that function words act as distractions in cross-modal alignment under attack scenarios, contributing to model vulnerability.
2. Formal Specification and Mechanism
FDA formalizes function-word distraction subtraction at the attention head level within the fusion encoder. Consider an image and a text, with pretrained encoders (visual) and (textual) producing features and . A standard cross-attention for layer , head is defined as:
FDA introduces a function-word mask , extracting function-word token features , and computes function-word attention scores: Two distraction maps are formed by softmax normalization over the visual and textual token dimensions:
A scalar gate (typically set to 1) determines the subtraction scale. FDA then defines the de-attended output: This output replaces the standard attention for downstream computation, and multi-head outputs are concatenated as usual.
3. Integration with Vision-Language Architectures
FDA is applied within selected fusion-encoder layers and attention heads. The recommended configuration applies FDA to the first two layers () and the first half of heads (–$5$) of the fusion module. Implementation is parameter-free beyond the original attention module; the gate may be left fixed. The function-word dictionary consists of 93 selected stop-words. FDA is integrated at inference or finetuning phases without architectural modification, and finetuning is performed for each downstream task—typically for 10 epochs—with only the last epoch used for evaluation. FDA is evaluated on ALBEF, TCL, and BLIP architectures employing BERT as the text encoder and ViT-B/32 for vision.
4. Experimental Protocol and Benchmarks
FDA is benchmarked on two core downstream tasks: text-to-image and image-to-text retrieval (Flickr30k, MSCOCO), and visual grounding (RefCOCO+). Models include ALBEF (14M parameters), TCL (14M), and BLIP (124M). The experimental setup encompasses white-box PGD, AutoAttack (APGD), and adaptive Masked APGD (MAPGD, which masks function words to circumvent FDA) at perturbation levels and $4/255$. Baselines include TeCoA and FARE (adversarially fine-tuned CLIP-style models). The principal evaluation metric is Attack Success Rate (ASR), using the relative improvement .
Results summary:
| Task / Model (Dataset) | Perturbation () | ASR Drop (FDA) | Clean Accuracy Change |
|---|---|---|---|
| Retrieval / ALBEF (Flickr30k) | $2/255$ | 22.3% | -0.3% |
| Retrieval / TCL (Flickr30k) | $2/255$ | 14.3% | -0.5% |
| Retrieval / BLIP (Flickr30k) | $2/255$ | 51.6% | -0.7% |
| Retrieval / ALBEF (MSCOCO) | $2/255$ | 9.3% | +0.1% |
| Visual Grounding / ALBEF (RefCOCO+) | $2/255$ | 93.2% | +0.3% |
FDA consistently lowers ASR by substantial margins with only minor or positive effects on clean accuracy. Similar trends persist under stronger attacks ($4/255$) and untargeted settings, and plugging FDA into baseline robust models yields additional incremental gains.
5. Robustness, Scalability, and Generalization
FDA exhibits favorable scaling characteristics: larger model backbones such as BLIP (124M) exhibit greater absolute robustness improvements (~54% ASR drop) compared to 14M-parameter models (~15%). FDA demonstrates model-agnostic applicability and integrates with existing robustification methods such as TeCoA and FARE for cumulative benefits. Zero-shot deployment without further finetuning provides nontrivial improvements in retrieval (up to +0.47% R@1/5) and grounding tasks (+0.22% accuracy).
6. Ablation Analyses and Design Insights
Multiple ablation studies dissect the contributions of FDA’s components:
- Masking vs. de-attention: Removing function words outright causes a 3% drop in clean accuracy but only a 1.6% gain in robustness (ASR drop); FDA achieves a higher robustness gain (23% ASR drop) with negligible clean loss.
- De-attention scope: Focusing de-attention on determiners or adjectives yields 9% and 15% ASR drop, respectively, not matching the full FDA across the function-word dictionary.
- Token selection: Robustness gain is correlated with the fraction of function words included; the default 93-word dictionary is as effective as a larger 208-word list.
- Encoder/application site: FDA performs best when integrated into the fusion encoder only, outperforming application in the text encoder or jointly.
- Layer/head localization: Restricting FDA to shallower layers and select heads is optimal for retrieval and grounding; applying it to all layers is most generalizable, especially for zero-shot use.
7. Limitations and Future Directions
FDA’s demonstrated efficacy is currently limited to fusion-encoder backbones (ALBEF, TCL, BLIP) and has not been tested on CLIP-style models. The control gate () remains fixed, and finer, dynamically learned gating mechanisms are unexplored. Potential adaptation to large-scale LLM-style VLMs, e.g., via LoRA, represents an open avenue. FDA requires no adversarial data or extra learnable parameters, is fully white-box compatible, and serves as the first technique to refine cross-modal attention by explicitly subtracting function-word distractions, delivering substantial free robustness gains with near-zero accuracy cost (Tian et al., 8 Dec 2025).