VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Published 9 Oct 2025 in cs.CR and cs.AI | (2510.09699v1)

Abstract: Vision-LLMs (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces VisualDAN, a novel adversarial method embedding harmful 'Do Anything Now' commands in images to exploit VLMs.
The methodology involves adversarial image training, achieving high attack success rates across models like MiniGPT-4 and InstructBLIP.
Results show increased toxicity in VLM outputs, underlining the urgent need for robust multimodal defense strategies.

"VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands" (2510.09699)

Introduction

Vision-LLMs (VLMs) integrate visual and textual modalities, advancing multimodal content interpretation and generation. Despite significant alignment efforts, VLMs remain vulnerable to unique attack vectors, notably visual-driven adversarial inputs. This paper introduces an innovative adversarial strategy, VisualDAN, which employs an adversarial image embedded with "Do Anything Now" (DAN) commands to manipulate VLMs into generating harmful content. VisualDAN reveals significant vulnerabilities in multimodal systems, highlighting the urgency for robust security measures in VLM applications.

Methodology

Adversarial Image Generation

VisualDAN leverages the high dimensionality of visual inputs to embed harmful textual instructions, effectively bypassing VLM safeguards. The process involves training an adversarial image on a DAN-style harmful corpus, wherein affirmative prefixes such as "Sure, I can provide the guidance you need" are prepended to malicious instructions. This DAN-style corpus enables the crafted image to induce VLMs to generate content in response to harmful queries (Figure 1).

Figure 1: Pipeline of the proposed VisualDAN, showing the process of injecting DAN commands into adversarial images for effective manipulation of VLMs.

Attack Execution

The attack is executed by introducing a single adversarial image into VLM input, which then prompts the model to comply with a spectrum of harmful instructions. The adversarial training maximizes the probability of the VLM producing malicious outputs when presented with the modified visual input.

Experiments

Attack Success Rate

Extensive experiments demonstrated VisualDAN’s effectiveness across several VLMs, including MiniGPT-4 and InstructBLIP. The attack success rate was analyzed using metrics such as keyword-based assessments, Llama-Guard2 classifications, and GPT-4 evaluations. Results indicated VisualDAN’s capability to consistently overcome the defenses of well-aligned models, achieving high success rates without requiring model fine-tuning.

Figure 2: Attack Success Rate before and after the VisualDAN attack on the Manual-40 corpus.

Toxicity Analysis

Using tools such as Detoxify and Perspective API, the study assessed the toxicity levels of VLM-generated content under VisualDAN attacks. Increased toxicity scores confirmed the greater propensity of models to produce harmful content post-compromise. Notably, even minor toxic inputs, when combined with DAN injection, significantly amplified undesirable outputs.

Figure 3: Examples of malicious instructions and model outputs, highlighting the harmful information in red.

Discussion

VisualDAN underscores critical vulnerabilities in VLMs by exploiting the interaction of visual and linguistic modalities. This adversarial approach reveals that visual perturbations can be as dangerous as textual manipulations, suggesting that VLMs’ alignment strategies must account for multimodal threats. Current defenses remain insufficient, necessitating further development of resilient safeguarding techniques against such adversarial attacks.

Figure 4: Effectiveness of DAN Injection, illustrating its role in maximizing the attack impact of adversarial images.

Conclusion

The exploration of VisualDAN provides crucial insights into the adversarial susceptibilities of VLMs. This study emphasizes the need for advanced and multimodal-aware strategies to improve model robustness and align AI systems with ethical standards. Future work should focus on enhancing transparency within VLM architectures and developing comprehensive mitigation approaches to safeguard against diverse attack vectors. Addressing these vulnerabilities is vital for maintaining the integrity and trustworthiness of AI systems in real-world applications.

Markdown