- The paper presents a novel probabilistic framework (VERA-V) that formulates jailbreaking VLM outputs using variational inference.
- It integrates typographic, diffusion-based, and distractor techniques to generate multifaceted adversarial prompts.
- Experimental results demonstrate up to 53.75% higher attack success rates and robust transferability across different models.
"VERA-V: Variational Inference Framework for Jailbreaking Vision-LLMs" (2510.17759)
Overview
Vision-LLMs (VLMs) extend traditional LLMs by incorporating visual reasoning, which significantly advances applications such as visual question answering and autonomous agents. However, this integration also opens VLMs to specific vulnerabilities. The paper "VERA-V: Variational Inference Framework for Jailbreaking Vision-LLMs" introduces a novel approach using variational inference to identify and exploit these vulnerabilities by generating adversarial text-image prompts. This framework, known as VERA-V, frames multimodal jailbreak discovery as learning a joint posterior distribution, revealing vulnerabilities more thoroughly and systematically than existing methods.
Probabilistic Framework for Jailbreaking
VERA-V reformulates the challenge of generating adversarial prompts for VLMs into a probabilistic problem. It employs a lightweight attacker model to approximate a distribution over text-image pairs, which allows for the efficient generation of diverse jailbreaks. The method integrates three core strategies for attack: typography-based prompts, diffusion-based image synthesis, and structured distractors.
- Typography-based Prompts: Utilizes fonts and styles to encode text within images, evading text filters.
- Diffusion-based Image Synthesis: Uses adversarial imagery created through diffusion models to embed harmful signals implicitly.
- Structured Distractors: Introduces unrelated images as distractors to fragment VLM attention and obscure malicious intent.
Implementation and Methodology
The implementation of VERA-V involves training an attacker model using reinforcement learning principles to optimize for attack success. The framework evaluates the harmfulness of VLM outputs using a judge function, which provides feedback to iteratively enhance the adversarial prompt generation.
Algorithmic Steps:
- Sample Text-Image Prompts: Use the attacker model to generate initial candidate prompts.
- Generate Composite Images: Combine typographic and diffusion-generated elements with distractors.
- Query Target VLM: Assess attack success based on VLM responses and refine prompts accordingly.
- Optimization: Employ a policy gradient method to adjust the attacker model based on feedback from the judge function.
Experimental Results
VERA-V demonstrates state-of-the-art performance on benchmarks such as HarmBench and HADES, achieving significant improvements in attack success rates over existing red-teaming methods. Notably, it achieves up to 53.75% higher success on certain models like GPT-4o, illustrating its robust attack capabilities.
Figure 1: Single-attack vs. feedback-driven multimodal jailbreaks. CS-DJ decomposes harmful queries into typographic sub-images and distractors, while VERA-V refines joint text-image prompts through optimization.
Figure 2: Diversity evaluation of VERA-V prompts showing increased diversity and deviation from template prompts, enhancing its robustness and adaptability.
Transferability and Stealth
VERA-V prompts exhibit strong transferability across different VLM architectures, suggesting that the vulnerabilities it uncovers are widely applicable. Moreover, the approach minimizes toxicity detection rates, indicating that its adversarial prompts are not only effective but also stealthy, evading typical safety mechanisms more successfully than competing methods.
Conclusion
VERA-V offers a powerful, probabilistic framework for generating adaptive and effective multimodal jailbreaks against VLMs. Its ability to model and explore the distribution of adversarial prompts yields more comprehensive evaluations of VLM safety and demonstrates significant practical implications for future AI safety and security research. The framework’s efficiency and adaptability make it a compelling tool for ongoing and future assessments of VLM robustness.