Entropy-bank Guided Adversarial Attacks
- EGA is a targeted adversarial method that exploits token-level uncertainty in vision-language models to efficiently alter output trajectories.
- It selects approximately 20% of high-entropy tokens and uses a reusable entropy bank to improve attack transferability across different model architectures.
- Empirical evaluations show over 93% attack success and a significant increase in harmful content conversion compared to global perturbation methods.
Entropy-bank Guided Adversarial Attacks (EGA) refer to a targeted adversarial methodology designed to exploit uncertainty-driven vulnerabilities within autoregressive vision-LLMs (VLMs). EGA leverages the observation that only a small subset (approximately 20%) of token prediction steps—marked by high token-level entropy—primarily governs output trajectories. By concentrating perturbations at these “critical decision points,” EGA enables both model-specific and cross-model semantic degradation with greater efficiency and transferability than prior global or random attacks. The approach is distinguished by the construction of a reusable entropy bank, allowing for selective, high-impact targeting of tokens that are empirically most susceptible to adversarial manipulation (He et al., 26 Dec 2025).
1. Entropy in Vision-LLM Decoding
In the context of EGA, entropy serves as a measure of prediction uncertainty at each step of autoregressive VLM generation. For a VLM that autoregressively produces a token sequence conditioned on an image , the token-level entropy at time is given by
where is the model’s predicted probability for token at step and is the vocabulary. A single teacher-forced pass on clean images yields an entropy profile , quantifying generative instability at each position.
2. Selection and Characterization of High-Entropy Tokens
EGA identifies the tokens responsible for the majority of generation “forks” by selecting the top quantile of positions with the highest entropies. Specifically, entropies are ranked in descending order, and a set is formed, where and is the selection ratio (default ). Alternatively, a threshold can define the set with . Empirical findings establish that this ≈20% subset governs the bulk of generative trajectory changes, thus presenting an optimal attack surface.
3. Construction and Integration of the Entropy Bank
EGA constructs a reusable entropy bank to generalize across images and models. This bank comprises the top tokens from the vocabulary with the highest empirical “flip-rate,” calculated as:
Tokens are ranked by flip-rate, and the bank consists of the top- most “flippable.” At attack time, per-image high-entropy masks are augmented with token positions in the bank: , yielding the mask . This augmentation increases attack effectiveness and enables transferability across VLM architectures.
4. EGA Attack Optimization Procedure
EGA optimizes adversarial perturbations by maximizing the average entropy over the selected positions , under an norm constraint on the pixel-level perturbation:
where is the normalized pixel input and is the standard perturbation budget. Projected gradient ascent with momentum (or Adam optimizer) is performed over 300 steps, with periodic mask refresh (every steps). Exact pseudocode is provided in the source (He et al., 26 Dec 2025), specifying key steps: initial greedy decoding, entropy computation, mask formation, iterative gradient updates, and final adversarial decoding.
5. Empirical Results and Baseline Comparisons
Quantitative evaluation on standard VLMs (Qwen2.5-VL-7B-Instruct, InternVL3.5-4B, LLaVA-1.5-7B) demonstrates that EGA achieves attack success rates (ASR) of – with CIDEr scores near 0.85–0.88. Crucially, EGA induces harmful content (violence, self-harm, hate, etc.) in – of outputs—more than doubling the harmful rate achieved by global-entropy (MIE) baselines (–) and greatly exceeding standard methods such as PGD, VLA, and COA (all harmful conversion). In visual question answering, similar patterns are observed, with EGA producing – harmful conversions and retaining ASR. Transferability metrics show EGA achieves – harmful rates on unseen target models, compared with for transferred XTA or MIE attacks.
| Model | ASR (%) | ΔCIDEr | Harmful Rate (%) |
|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 94.8 | 0.88 | 42.5 |
| InternVL3.5-4B | 93.8 | 0.86 | 37.3 |
| LLaVA-1.5-7B | 93.1 | 0.85 | 47.1 |
6. Key Insights and Implications
Empirical findings reveal that approximately 20% of tokens with highest entropy govern the majority of generative variability. Restricting adversarial perturbations to these “decision tokens” not only preserves attack efficacy but leads to disproportionate semantic drift—enabling conversion of – of benign outputs to harmful content, a vulnerability underreported by prior global-entropy attacks. High-entropy tokens show recurrence across different VLM architectures, enabling practical cross-model transfer of attacks.
A salient implication is that robustness interventions should focus on stabilizing next-token distributions specifically at high-entropy forks, for example, via uncertainty regularization or dynamic token masking, instead of merely increasing average-case robustness. EGA thus exposes a concentrated vulnerability in modern VLM decoding and motivates reevaluation of multimodal safety mechanisms (He et al., 26 Dec 2025).
7. Practical Considerations and Implementation
EGA is implemented under a standardized -norm constraint on normalized pixels ( as default), with optimizer options including Adam (, ) or momentum-based PGD (). Perturbations are iteratively updated, and the selected entropy mask may be optionally refreshed to account for evolving generative trajectories. Parameters such as token budget (), entropy bank size (), and mask refresh interval () are configurable, with efficacy robust across reasonable variations. Decoding is performed greedily with typical sequence length restrictions.
A plausible implication is that the EGA framework and entropy bank concept may be adapted to broader autoregressive, multimodal, or sequence-generation settings where token-level uncertainty can be exploited for selective adversarial control.