VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

Published 20 Oct 2025 in cs.CR, cs.CL, cs.CV, cs.LG, and stat.ML | (2510.17759v1)

Abstract: Vision-LLMs (VLMs) extend LLMs with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel probabilistic framework (VERA-V) that formulates jailbreaking VLM outputs using variational inference.
It integrates typographic, diffusion-based, and distractor techniques to generate multifaceted adversarial prompts.
Experimental results demonstrate up to 53.75% higher attack success rates and robust transferability across different models.

"VERA-V: Variational Inference Framework for Jailbreaking Vision-LLMs" (2510.17759)

Overview

Vision-LLMs (VLMs) extend traditional LLMs by incorporating visual reasoning, which significantly advances applications such as visual question answering and autonomous agents. However, this integration also opens VLMs to specific vulnerabilities. The paper "VERA-V: Variational Inference Framework for Jailbreaking Vision-LLMs" introduces a novel approach using variational inference to identify and exploit these vulnerabilities by generating adversarial text-image prompts. This framework, known as VERA-V, frames multimodal jailbreak discovery as learning a joint posterior distribution, revealing vulnerabilities more thoroughly and systematically than existing methods.

Probabilistic Framework for Jailbreaking

VERA-V reformulates the challenge of generating adversarial prompts for VLMs into a probabilistic problem. It employs a lightweight attacker model to approximate a distribution over text-image pairs, which allows for the efficient generation of diverse jailbreaks. The method integrates three core strategies for attack: typography-based prompts, diffusion-based image synthesis, and structured distractors.

Typography-based Prompts: Utilizes fonts and styles to encode text within images, evading text filters.
Diffusion-based Image Synthesis: Uses adversarial imagery created through diffusion models to embed harmful signals implicitly.
Structured Distractors: Introduces unrelated images as distractors to fragment VLM attention and obscure malicious intent.

Implementation and Methodology

The implementation of VERA-V involves training an attacker model using reinforcement learning principles to optimize for attack success. The framework evaluates the harmfulness of VLM outputs using a judge function, which provides feedback to iteratively enhance the adversarial prompt generation.

Algorithmic Steps:

Sample Text-Image Prompts: Use the attacker model to generate initial candidate prompts.
Generate Composite Images: Combine typographic and diffusion-generated elements with distractors.
Query Target VLM: Assess attack success based on VLM responses and refine prompts accordingly.
Optimization: Employ a policy gradient method to adjust the attacker model based on feedback from the judge function.

Experimental Results

VERA-V demonstrates state-of-the-art performance on benchmarks such as HarmBench and HADES, achieving significant improvements in attack success rates over existing red-teaming methods. Notably, it achieves up to 53.75% higher success on certain models like GPT-4o, illustrating its robust attack capabilities.

Figure 1: Single-attack vs. feedback-driven multimodal jailbreaks. CS-DJ decomposes harmful queries into typographic sub-images and distractors, while VERA-V refines joint text-image prompts through optimization.

Figure 2: Diversity evaluation of VERA-V prompts showing increased diversity and deviation from template prompts, enhancing its robustness and adaptability.

Transferability and Stealth

VERA-V prompts exhibit strong transferability across different VLM architectures, suggesting that the vulnerabilities it uncovers are widely applicable. Moreover, the approach minimizes toxicity detection rates, indicating that its adversarial prompts are not only effective but also stealthy, evading typical safety mechanisms more successfully than competing methods.

Conclusion

VERA-V offers a powerful, probabilistic framework for generating adaptive and effective multimodal jailbreaks against VLMs. Its ability to model and explore the distribution of adversarial prompts yields more comprehensive evaluations of VLM safety and demonstrates significant practical implications for future AI safety and security research. The framework’s efficiency and adaptability make it a compelling tool for ongoing and future assessments of VLM robustness.

Markdown