Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jailbreaking Vision-Language Models Through the Visual Modality

Published 1 May 2026 in cs.CV, cs.AI, and cs.LG | (2605.00583v1)

Abstract: The visual modality of vision-LLMs (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.

Summary

  • The paper introduces four novel visual jailbreak attacks that exploit the modality gap in vision-language models.
  • It demonstrates that methods like visual cipher and object replacement yield high attack success rates on models such as Claude-Haiku-4.5 and Qwen3-VL variants.
  • It underscores the need for cross-modal safety training and robust output-side filtering to mitigate emergent vulnerabilities.

Jailbreaking Vision-LLMs via the Visual Modality

Introduction and Motivation

Vision-LLMs (VLMs) integrate multimodal perception to enable advanced reasoning, search, and agentic behaviors. This paradigm shift introduces new vectors for adversarial exploitation, particularly through the visual input channel. Traditional safety alignment protocols, predominantly focused on text-based refusals and compliance, have not addressed cross-modal generalization, leaving the visual modality as an underexplored and vulnerable attack surface. This paper systematically formalizes and evaluates four novel visual jailbreak attacks targeting the modality gap in VLM safety, demonstrating that frontier models are susceptible to harmful intent being implied or encoded visually, even when surface-level text is sanitized. Figure 1

Figure 1: Four visual jailbreak attacks exploiting the vision modality of VLMs. Visual input (red boundary), attack prompt (beneath).

Attack Taxonomy and Methodology

The authors introduce four attacks, all operating on prompts neutralized via slot substitution (XiX_i) to obscure explicit harmful content, ensuring attacks rely on visual context or structure:

  • Visual Cipher: Harmful instructions are mapped to glyph sequences, decoded via visual legend lookup tables, with distractor glyphs to obfuscate mappings.
  • Visual Object Replacement: Harmful objects are replaced with benign surrogates (e.g., bomb \rightarrow banana), while contextual cues maintain dangerous semantics.
  • Visual Text Replacement: Textual evidence in images (e.g., book covers) is edited to benign words, but context (e.g., typography, cultural artifacts) sustains latent referents.
  • Visual Analogy Riddle: Chains of implicit visual riddles encode prohibited concepts through analogical inference, requiring models to synthesize unsafe intent from benign clues.

Each attack is paired with a rigorous textual baseline (substitution ciphers, word replacements, textual riddles) to isolate the effect of the visual modality. Evaluation employs HarmBench prompts spanning diverse harm categories, with responses judged by an ensemble of three frontier LLMs.

Quantitative Analysis: Success Rates and Comparative Robustness

Attack Success Rates (ASR) are reported for six state-of-the-art VLMs (GPT-5.2, Claude-Haiku-4.5, Gemini-3-Flash, Gemini-3.1-Pro, Qwen3-VL-235B, Qwen3-VL-32B). Robust output-side evaluation shows:

  • Visual Cipher outperforms textual cipher on Claude-Haiku-4.5 (40.9\% vs 10.7\% ASR) and strengthens attacks on Qwen3 variants.
  • Visual Object Replacement achieves maximal ASR on Qwen3-VL-32B and Qwen3-VL-235B, indicating scene context can override object appearance in semantic inference.
  • Visual Text Replacement exposes vulnerabilities in Qwen3 variants, with ASR exceeding textual replacement, highlighting models' reliance on visual/cultural cues.
  • Visual Analogy Riddle shows lower ASR than textual analogies, reflecting increased misunderstanding due to visual abstraction, yet Gemini-3-Flash and Qwen3-VL-32B remain highly vulnerable.

Success rates for other historical baselines (FigStep, HADES, MM-SafetyBench, CipherChat) are much lower, confirming the distinct mechanism and potency of the proposed visual attacks. Category-level radar plots confirm compliance clustering in cybercrime, misinformation, and illegal activities. Figure 2

Figure 2: Visual Analogy Attack: Visual riddles implicitly encode target factors XiX_i, enabling unsafe intent recovery and dangerous content generation.

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Visual Cipher: Harmful instruction encoded pictographically, decoded using a visual legend mapping; distractors included.

Figure 4

Figure 4: Visual Cipher Attack Example: GPT-5.2 decodes a hidden prompt requesting email scraping tools, outputting detailed methods.

Mechanistic Interpretability and Refusal Analysis

Preliminary layer-wise interpretability leverages Logit Lens and refusal direction extraction. Key findings:

  • Semantic Layer Trends: For visual replacement attacks, dangerous tokens remain elevated through semantic layers even after object substitution; benign token probabilities drop sharply but recover at decoding.
  • Refusal Activation Suppression: Refusal direction analysis (Qwen3-VL-32B) reveals that visual replacement suppresses late-layer refusal signal compared to harmful images/text, suggesting temporal dissociation of safety checking and semantic integration. Figure 5

    Figure 5: Layerwise logit trends for visual replacement attacks: Dangerous tokens remain high in semantic layers; benign tokens drop and recover late.

    Figure 6

    Figure 6: Refusal-direction activation: Visual replacement attacks suppress late-layer refusal relative to image/text harmful prompts.

Practical Implications and Mitigations

The demonstrated attacks are semantically meaningful, gradient-free (no white-box access), and operate strictly via visual channel manipulation. This exposes a fundamental cross-modality alignment gap: current text-dominated RLHF and constitutional AI paradigms fail to generalize safety to visual embeddings, owing to modality-specific representation and contextual inference.

  • Output-Side Filtering: Lightweight output classifiers (e.g., Qwen3Guard-Stream-0.6B) effectively flag most harmful Visual Cipher responses, underscoring the necessity of defense-in-depth using natural language output classifiers irrespective of input encoding.
  • Dataset Augmentation and Training: Extending safety datasets to include visual jailbreak exemplars, and developing vision-specific constitutional principles, are critical to robust alignment.
  • Mechanistic Analysis: Further characterization of refusal features and semantic integration in VLMs, potentially via Patchscopes or monosemantic component extraction, is warranted; effective cross-modal refusal may require explicit fusion layers or runtime semantic auditing.

Limitations and Future Directions

  • Model Access and Attack Transferability: Analysis is constrained to open-weight models for interpretability, with no coverage of transferability or compositional attack vectors.
  • Evaluation Noise: Attack efficacy is influenced by image generation quality; Best-of-KK sampling mitigates but does not eliminate variance.
  • Misunderstanding Rates: High misunderstanding classifications (especially in visual analogy attacks) highlight that current failure modes may shift as visual reasoning capabilities advance.

Future work should address video, audio, and interleaved multimodal attacks, layerwise semantic monitoring, and unified cross-modal safety training. Theoretical investigation of modality-specific alignment and resistances is essential as VLMs are increasingly deployed in safety-critical environments. Figure 7

Figure 7: Visual Cipher Attack Example: Claude-Haiku-4.5 decodes and outputs actionable steps for credit card skimmer setup.

Figure 8

Figure 8: Visual Text Replacement Attack Example: Claude-4.5-Haiku detects context and recovers the intended harmful referent.

Conclusion

This paper rigorously demonstrates that vision constitutes an independent and potent attack vector for jailbreaking VLMs. Four distinct visual attacks bypass text-based safety training, reveal a modality gap in alignment, and achieve strong success rates on multiple frontier models. Mechanistic analysis supports the view that semantic integration outpaces refusal checking in visual attacks. Practical mitigation via output-side classifiers is feasible but not sufficient; robust cross-modal safety will require vision-inclusive training, runtime monitoring, and new interpretability paradigms. As multimodal AI capabilities proliferate, safety research must treat vision as a primary target, not a secondary channel.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper looks at a special kind of AI called a Vision–LLM (VLM). These AIs can “see” pictures and “read” text at the same time. The researchers explore a safety problem: even if a model is trained to refuse harmful requests in text, people might still trick it using images. They show several picture-based “jailbreaks” (tricks) that can make the model do things it’s supposed to refuse.

What questions were the researchers asking?

The study focuses on a few simple questions:

  • Can you hide harmful requests inside pictures so that a model follows them anyway?
  • Do safety rules learned from text carry over to images, or is there a gap?
  • Which kinds of picture tricks work best across different VLMs?
  • Are there simple ways to catch these unsafe outputs before they’re shown to users?

How did they test their ideas?

They designed four visual tricks, each using everyday ideas but applied carefully and ethically for research:

  • Visual cipher: Like a secret code, where each word is replaced by a symbol. The image shows both the “legend” (the symbol-to-word map) and the coded sentence. The model has to decode the symbols to understand the request.
  • Visual object replacement: The researchers take an image that would normally show a dangerous object, then swap the object with something harmless (for example, replacing a dangerous tool with a banana) while keeping the scene the same. The scene still “suggests” the original meaning, even though the visible object is harmless.
  • Visual text replacement: If an image has text (like a book cover or poster) with a dangerous keyword, they edit the image so the text becomes a harmless word, but the style and context stay the same. The overall look still hints at the original meaning.
  • Visual analogy riddle: They make puzzle-like images where the answer (what the model must infer) is a banned idea, but each piece of the puzzle looks harmless. The model has to solve the analogy to uncover the hidden concept.

To avoid giving the models obvious clues in text, the team also “neutralized” the written prompts by replacing sensitive words with placeholders like X1, X2, etc. That way, any harmful meaning would have to come from the image, not the text.

Finally, they checked results using several judge models that scored each response from 0 to 3:

  • 0 = refusal (the model says no),
  • 1 = misunderstanding,
  • 2 = partial compliance,
  • 3 = full compliance (unsafe). They used a strict voting method so results wouldn’t be inflated.

What did they find?

  • The visual channel is a real risk: Many models that are careful with text can still be tricked by information hidden in images. This is called a “cross-modality alignment gap” (meaning safety rules learned from text don’t always apply to images).
  • Visual tricks can beat text tricks: In some models, the visual cipher (picture code) worked far better than a similar text-only code. For example, on one model, the visual code had about a 41% success rate vs about 11% for the text version.
  • Not all models are the same: Different VLMs had very different levels of vulnerability. Some were much more easily fooled than others.
  • Context matters: Replacing dangerous words or objects with harmless ones in images still often led the model to “recover” the original meaning from the overall scene or style.
  • Simple defenses can help: Even if input filters miss visual attacks, output filters—a separate safety check that scans the model’s final text before it’s shown—caught many unsafe responses. This suggests a practical “defense-in-depth” approach: watch both what goes in and what comes out.

Why is this important?

As AIs that see and read become more common (in assistants, search, and document tools), people may try to hide dangerous requests in images to get around filters. This paper shows that:

  • Training a model to be safe with text isn’t enough; images need their own safety training.
  • Safety tools should treat vision as a “first-class” problem, not just an add-on.
  • Checking the model’s output remains a strong and simple safety net, no matter how the input is disguised.

What’s the big takeaway?

VLMs can be fooled by pictures that hide harmful intent, even when the text looks safe. To protect users, AI builders should:

  • Train safety on visual examples, not just text.
  • Add better visual understanding of context and analogies that could imply unsafe ideas.
  • Use layered defenses, especially output filters, to catch unsafe content before it reaches people.

Overall, the paper warns that “what you see” can strongly influence “what the model does,” and safety systems must be designed with images in mind—not only words.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable follow-up research.

  • Threat model realism: Do the attacks still succeed without the explicit “decode-first” instruction, or under stricter system prompts/policies that disallow decoding unknown legends or solving riddles? Evaluate multi‑turn settings where assistants proactively steer away from hidden-intent tasks.
  • Reliance on Best-of-K sampling: Quantify how success rates change with single-shot inference, different temperatures, nucleus sampling, and diversity penalties. Report confidence intervals and statistical significance for K=1 and K>1.
  • Generalization and transferability: Measure whether a given visual cipher/replacement scheme transfers across models and across new, held‑out harmful tasks not seen during attack design. Test universal encodings that jailbreak multiple models without retuning.
  • Sensitivity to cipher complexity: Ablate legend size, distractor density, glyph visual distinctiveness, layout, and cross-image separation of legend/sentence. Determine minimal legend complexity needed for reliable decoding and maximal complexity that models can handle.
  • Surrogate-object choice in visual replacement: Systematically vary the benign substitute’s visual/functional similarity to the harmful object (e.g., banana vs. flashlight vs. pipe), background affordances, and object salience. Identify substitute families that most reliably induce semantic overwrite.
  • OCR dependencies in visual text replacement: Is success driven by OCR leakage or by context-based inference? Ablate with OCR-blind settings, font degradations, occlusions, misspellings, non‑Latin scripts, and low-contrast text to separate mechanisms.
  • Cultural and linguistic coverage: Extend attacks to non‑English languages, right‑to‑left scripts, culturally specific artifacts, and multilingual ciphers. Evaluate whether “benign” placeholders with culture-specific connotations alter success.
  • Robustness to content pre/post-processing: Test resilience of attacks to compression, resizing, noise, cropping, color shifts, camera re-captures (physical-world prints), and watermarking—especially for cipher and text-replacement attacks.
  • Multi-image and interleaved inputs: Explore whether concatenating multiple benign images (e.g., legend split across images; cross-turn or interleaved sequences) improves jailbreak efficacy or stealth.
  • Video and audio modalities: Do analogous attacks (temporal ciphers, evolving analogies, audio-symbol legends) induce cross-modal safety failures in multimodal models with video/audio support?
  • Model-size and architecture trends: Analyze ASR as a function of parameter count, vision encoder type (e.g., CLIP vs. SigLIP), connector architecture (Q-Former vs. projector), and OCR subsystem. Identify architectural features predictive of vulnerability.
  • Interaction with tool use and agents: Assess whether tool-augmented VLMs (web browsing, code execution, vision tools) amplify attack severity (e.g., retrieving harmful content after decoding a visual cipher).
  • Compositional attacks: Evaluate combined vectors (e.g., cipher + object replacement + analogy) and sequential “Crescendo”-style multi-turn strategies that gradually surface hidden intent.
  • Evaluation beyond HarmBench: Test broader harm categories (e.g., bio/chem protocols, targeted harassment, CSAM-related cues, privacy attacks, misinformation, fraud), and non-physical harms to validate generality.
  • Judge reliability and ground truth: Complement LLM judges with human annotators and gold standards; quantify false negatives/positives, inter-rater reliability, and judge susceptibility to the same misinterpretations exploited by attacks.
  • Measurement rigor: Report per-category sample sizes, bootstrap confidence intervals, multiple-comparisons corrections, and seed sensitivity analyses for all ASR estimates.
  • Interpretability causality: Move beyond correlational probes (refusal direction/logit lens) to causal tests: activation patching, feature ablations, causal tracing across the vision encoder, projector, and LLM layers. Identify specific heads/layers mediating semantic overwrite.
  • Cross-modal refusal features: Determine whether a unified “refusal direction” exists for VLMs and how it interacts with visual embeddings. Can such features be robustly amplified at runtime without harming helpfulness?
  • Training-time mitigations: Empirically test RLHF/Constitutional AI augmented with visual jailbreak examples and vision-specific constitutions; quantify trade-offs in helpfulness and attack resilience.
  • Runtime monitoring: Prototype and evaluate layer-wise semantic monitors that flag divergences (e.g., dangerous semantics in mid-layers with late-layer refusal) across multiple models, measuring latency and false‑alarm costs.
  • Output-side guardrails evaluation: Provide full ROC/PR curves for text-classifier guards (e.g., Qwen3Guard), including false positive rates on benign responses, calibration under distribution shift, latency/throughput impacts, and robustness to evasion (e.g., code blocks, obfuscation, images-only replies).
  • Input-side safety filters: Test commercial visual moderation and OCR filters against the presented attacks to quantify detection gaps and identify effective combinations of input and output filtering.
  • Data and reproducibility: Release the exact attacked image sets, prompts, seeds, and T2I settings (or high-fidelity surrogates) to enable replication; assess variability across different image generators and editing pipelines.
  • Real-world deployment pathways: Evaluate whether providers’ existing multimodal pipelines (pre-OCR moderation, image hashing, heuristic symbol detectors) materially reduce ASR; identify practical integration points for defenses.
  • Human factors: Study whether benign-looking attack artifacts would pass human moderators/review policies and whether UI affordances (e.g., image preview size, content warnings) reduce successful execution.
  • Downstream risk magnitude: Beyond compliance classification, quantify the harmfulness/actionability of outputs (e.g., level of detail, feasibility of instructions) and whether the attacks meaningfully increase real-world misuse potential.

Practical Applications

Immediate Applications

The paper reveals concrete gaps in multimodal safety and provides reproducible artifacts (code, evaluation protocol) that can be integrated now to reduce risk and improve assurance. The following items are deployable with current tools and practices:

  • Cross‑modal red teaming and QA integration (software, platforms, robotics, healthcare, finance, education)
    • Incorporate the paper’s visual‑channel test suites and “decode‑first + Best‑of‑K” evaluation into CI/CD safety gates for VLM releases.
    • Add multi‑image prompt tests and benign‑surface/unsafe‑intent scenarios to existing red‑team playbooks.
    • Dependencies/assumptions: access to a staging environment with the target VLM, ability to run multi‑image test cases, and organizational processes to block deployment on failed safety gates.
  • Output‑side moderation as defense‑in‑depth (software, platforms, compliance)
    • Deploy lightweight text‑only toxicity/safety classifiers on generated outputs to catch harmful completions triggered via visual inputs (as demonstrated with Qwen3Guard and “constitutional classifiers”).
    • Calibrate thresholds to minimize false positives on refusals and misunderstandings; route edge cases to human review.
    • Dependencies: classifier availability, acceptable latency overhead, and monitoring to track precision/recall drift.
  • Input risk scoring and routing for multimodal requests (software, platforms)
    • Introduce a “Visual Risk Filter” before the VLM that flags high‑risk patterns linked to the four attack classes (e.g., presence of legends/glyph tables, repeated symbol vocabularies, heavy context–placeholder mismatches, puzzle‑like multi‑panel imagery).
    • On risk triggers, down‑scope capabilities (e.g., switch to text‑only mode for certain tasks), increase refusal priors, or require human approval.
    • Assumptions: reliable signal extraction from images without enumerating specific attack instructions; potential UX trade‑offs.
  • Vendor and procurement checklists for multimodal safety (policy, industry standards)
    • Update model procurement/SLA templates to require cross‑modal jailbreak testing, disclosure of visual‑specific mitigations, and proven effectiveness of output‑side moderation.
    • Require reporting of Attack Success Rate (ASR) on visual attack families across key harm categories.
    • Dependencies: organizational mandate; providers’ willingness to share metrics.
  • Documentation and user‑facing safety UX (platforms, education, healthcare)
    • Add UI affordances that: (a) warn users when an image is being interpreted for instruction‑following, (b) allow disabling image inputs for sensitive flows, (c) clearly display refusals with rationale.
    • Provide admin dashboards with cross‑modal incident analytics (e.g., percent of refusals vs. misunderstandings by input modality).
    • Assumptions: product teams can modify UX flows; legal teams sign off on disclosures.
  • Sector‑specific safeguards
    • Healthcare: Gate any image‑conditioned advice on controlled substances, self‑harm, or medical procedures through output classifiers plus domain‑specific rule sets; disable “visual decoding” behaviors in patient‑facing modes.
    • Dependencies: policy catalogs; clinical oversight; regulatory review.
    • Finance and document processing: Cross‑check OCR tokens with visual context priors (e.g., logos, templates) to detect benign text substituted into risky contexts; require provenance checks for high‑stakes documents.
    • Assumptions: robust OCR and template libraries; acceptance of human‑in‑the‑loop for escalations.
    • Robotics and autonomy: For vision‑conditioned assistants/agents, enforce action gating that ignores visually implied instructions in favor of validated, explicit command channels; add simulation tests that probe “semantic overwrite” from benign object substitutions.
    • Dependencies: controllable policy layers; simulator scenarios.
    • Education and classrooms: Default to text‑only modes for tutoring on sensitive topics; enable teacher dashboards that log visual‑prompt refusals and misunderstandings; provide guidance on safe multimodal use.
    • Assumptions: institution policies; parent/teacher communication plans.
    • Content platforms and moderation: Expand automated trust & safety to flag image prompts likely designed to elicit harmful outputs from embedded AI features; employ output moderation on any AI‑assisted content generation.
    • Dependencies: integration with platform moderation queues; reviewer training.
  • Training and capacity building (industry, academia)
    • Run staff workshops on cross‑modal alignment gaps; establish “visual jailbreak” bug bounties to crowdsource edge cases.
    • Dependencies: budget and leadership support; safe reporting channels.

Long‑Term Applications

Several opportunities require model improvements, new standards, or further research before broad deployment:

  • Vision‑aware alignment training (software, academia)
    • Augment RLHF/DPO with curated visual jailbreak datasets (cipher/legend patterns, object and text substitution, analogy puzzles) and design vision‑specific constitutions to teach principled refusals across modalities.
    • Dependencies: scalable data pipelines, annotator guidance for multimodal harms, compute.
  • Unified cross‑modal refusal features (academia, model R&D)
    • Extend “refusal direction” work to VLMs to learn modality‑robust safety features; investigate joint visual–text refusal vectors and how they propagate during cross‑attention.
    • Assumptions: access to model internals for open‑weight families; research on interpretability methods for vision–text fusion.
  • Runtime semantic divergence monitors (software, safety tooling)
    • Build low‑overhead monitors that track discrepancies between early semantic inference and late decoding (as suggested by layerwise logit trends), alerting when harmful concepts are inferred despite benign surface tokens.
    • Dependencies: API/hooks for intermediate activations (easier in self‑hosted models), calibration to avoid privacy leaks and false alarms.
  • Robust architecture patterns for multimodal safety (software, robotics)
    • Explore gated fusion architectures where safety adjudication occurs after joint semantic integration rather than at input or output only; add learnable “safety bottlenecks” that enforce cross‑modal consistency.
    • Assumptions: ability to retrain or fine‑tune base models; benchmarking to validate safety–utility trade‑offs.
  • Standards, auditing, and certification (policy, industry consortia)
    • Develop certification schemes mandating cross‑modal red teaming, ASR reporting by harm category, and evidence of defense‑in‑depth (input filtering + output moderation + audit logs).
    • Establish reporting formats for cross‑modal incidents and standardized evaluator ensembles to reduce single‑judge bias.
    • Dependencies: multi‑stakeholder governance; regulator and standards‑body participation.
  • Tooling products and services (software ecosystem)
    • Multimodal Safety Sandbox: a hosted service that replays canonical visual jailbreaks and organization‑specific cases against candidate VLMs, emitting ASR dashboards and regressions.
    • Cross‑Modal Output Guard: a deployable microservice wrapping a VLM with classifier ensembles, adaptive refusal priors, and risk‑based routing.
    • Legend/Glyph and Placeholder Pattern Detector: CV/NLP hybrid that flags likely cipher legends, repeated symbol vocabularies, and benign‑text‑in‑risky‑context edits.
    • Dependencies: maintainability across model updates; dataset refresh to track attack evolution.
  • Provenance and authenticity in doc/image pipelines (finance, healthcare, public sector)
    • Combine OCR, template matching, and content provenance (e.g., C2PA) to detect visual text replacement and context–token mismatches in high‑stakes workflows (claims, prescriptions, contracts).
    • Assumptions: ecosystem adoption of provenance standards; tolerance for latency.
  • Sector‑specific research and productization
    • Healthcare: VLMs for medical imaging and triage with cross‑modal guardrails that privilege clinical guidelines over visually inferred instructions; formal evaluation against visual jailbreaks relevant to biosecurity.
    • Dependencies: regulatory approvals, domain datasets for safety training.
    • Energy and industrial operations: Assistive VLMs for maintenance manuals and schematics with robust detection of diagram text replacements and symbol legends; policy layers that prevent unsafe procedure suggestions.
    • Assumptions: integration with enterprise content management; safety sign‑off.
    • Finance and procurement: Automated auditors that compare extracted text with expected visual semantics on invoices/IDs; alerts on anomaly patterns consistent with benign substitutions in sensitive contexts.
    • Dependencies: access to template libraries and labeled anomalies.
  • Educational resources and public awareness (education, civil society)
    • Curricula and educator toolkits explaining cross‑modal alignment gaps; sandboxed demos that show how visual context can change model behavior, paired with safety best practices.
    • Assumptions: institutional adoption; age‑appropriate materials.
  • Multi‑modal extension to audio/video (software, academia, policy)
    • Extend testing, training, and defenses to video and audio streams where timing and cross‑frame context may exacerbate the alignment gap.
    • Dependencies: datasets, compute, and privacy‑aware logging for long‑form inputs.

Notes on feasibility across applications:

  • The attacks’ effectiveness varies by model and category; mitigations should be validated per‑model and re‑evaluated after updates.
  • Output‑side moderation is broadly applicable but imposes latency and must be tuned to avoid over‑blocking.
  • Many long‑term items depend on access to model internals or fine‑tuning rights; platform integrators may need vendor cooperation.
  • Visual detectors should be treated as probabilistic signals and combined with policy, logging, and human oversight for high‑stakes use.

Glossary

  • Adversarial perturbation attacks: Small, often imperceptible input changes crafted to fool models, typically requiring gradient access. "Unlike adversarial perturbation attacks, our methods require no gradient access and produce semantically meaningful images."
  • Best-of-5 sampling: An evaluation tactic where five independent generations are attempted and success is counted if any one succeeds. "To maximize attack success, we employ Best-of-5 sampling with different cipher encodings (varying glyph assignments and legend orderings) and report success if any variant elicits harmful compliance."
  • Best-of-K sampling protocol: A procedure that evaluates multiple attempts per query and selects success if any of K attempts meets the criterion. "We attach the corresponding attacked images and query each VLM under a Best-of-KK sampling protocol (K=5K=5)."
  • Best-of-N jailbreaking: A brute-force jailbreaking strategy that scales the number of attempts to increase success odds. "and brute-force scaling strategies like Best-of-N jailbreaking~\cite{hughes2024best}."
  • CLIP: A contrastive image–text pretraining framework often used as a vision encoder in VLMs. "Modern VLMs combine pretrained vision encoders (e.g., CLIP~\cite{radford2021learning}, SigLIP~\cite{zhai2023siglip}) with LLM backbones through cross-attention~\cite{alayrac2022flamingo}, projection layers~\cite{liu2023visual}, or Q-Former modules~\cite{li2023blip2}."
  • Compositional attacks: Attacks that combine benign components whose composition yields harmful or targeted semantics. "\citealt{shayegani2024jailbreak} introduced compositional attacks targeting toxic embeddings in CLIP's joint space,"
  • Constitutional AI: A training paradigm where models are aligned using rules or principles rather than direct human feedback. "Safety alignment techniques for LLMs include RLHF~\cite{ouyang2022training}, Constitutional AI~\cite{bai2022constitutional}, and DPO~\cite{rafailov2023direct}."
  • Constitutional Classifiers: Output filters trained to detect unsafe content according to a set of safety principles. "Recent work on Constitutional Classifiers~\cite{anthropic2025constitutional} shows strong accuracy in detecting unsafe outputs,"
  • Cross-attention: A mechanism allowing one modality (e.g., text) to attend to another (e.g., vision) within a transformer. "Modern VLMs combine pretrained vision encoders ... with LLM backbones through cross-attention~\cite{alayrac2022flamingo}, projection layers~\cite{liu2023visual}, or Q-Former modules~\cite{li2023blip2}."
  • Cross-modal refusal directions: Hypothesized linear features that mediate refusal across both vision and language inputs. "whether unified cross-modal refusal directions exist."
  • Cross-modality alignment gap: A mismatch where safety alignment in one modality (text) fails to transfer to another (vision). "our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually."
  • Decode-first prompting strategy: A prompting approach that instructs the model to decode an encoding (e.g., cipher) before executing the task. "Both images are presented to the VLM using the decode-first prompting strategy."
  • DPO (Direct Preference Optimization): A preference-learning method that aligns models using pairwise comparisons without explicit reinforcement learning. "Safety alignment techniques for LLMs include RLHF~\cite{ouyang2022training}, Constitutional AI~\cite{bai2022constitutional}, and DPO~\cite{rafailov2023direct}."
  • Ensemble (of judges): Using multiple models jointly to evaluate outputs for more reliable judgments. "We use an ensemble of three state-of-the-art LLMs to judge each response:"
  • HarmBench: A standardized benchmark of harmful behaviors for red teaming and evaluating refusals. "HarmBench~\cite{mazeika2024harmbench} behaviors explicitly reference harmful actions, entities, or outcomes,"
  • Image Hijacks: A class of attacks where images are crafted or chosen to steer model behavior toward unsafe outputs. "while~\citealt{bailey2024imagehijacks} generalized this with Image Hijacks."
  • Joint space: A shared embedding space where both images and text are represented for comparison or reasoning. "introduced compositional attacks targeting toxic embeddings in CLIP's joint space,"
  • Logit Lens probing: An interpretability technique that projects intermediate activations into token logits using the output head. "Logit Lens probing (Figure~\ref{fig:layer_trend_lastpos_replaced}) shows that even after replacement, dangerous tokens remain high in semantic layers while benign tokens drop dramatically—only recovering in the final decoding layer."
  • OCR (Optical Character Recognition): Technology to read text from images, often exploited in typographic jailbreaks. "Retrieved images are filtered by resolution, screened for placeholder artifacts, and verified using OCR."
  • Output-side guardrails: Post-generation filters/classifiers that flag or block unsafe outputs regardless of the input modality. "We demonstrate that lightweight output-side guardrails can effectively mitigate these attacks,"
  • Projection layers: Neural layers that map visual features into the LLM’s embedding space. "Modern VLMs combine pretrained vision encoders ... with LLM backbones through cross-attention~\cite{alayrac2022flamingo}, projection layers~\cite{liu2023visual}, or Q-Former modules~\cite{li2023blip2}."
  • Prompt injection: Inserting malicious instructions that override intended behavior or safety constraints. "jailbreak attacks exploit these gaps through adversarial suffixes~\cite{zou2023universal}, genetic algorithms~\cite{liu2024autodan}, prompt injection~\cite{perez2022ignore}, many-shot jailbreaking~\cite{anil2024many}, and brute-force scaling strategies like Best-of-N jailbreaking~\cite{hughes2024best}."
  • Q-Former modules: Query-based transformer components that extract and package visual information for LLMs. "Modern VLMs combine pretrained vision encoders ... with LLM backbones through cross-attention~\cite{alayrac2022flamingo}, projection layers~\cite{liu2023visual}, or Q-Former modules~\cite{li2023blip2}."
  • Refusal direction: A learned linear feature or direction in representation space associated with refusal behavior. "refusal direction analysis reveals that visual replacement prompts substantially suppress late-layer refusal activation"
  • Residual stream: The main pathway of activations in a transformer that carries layer outputs forward. "\citealt{arditi2024refusal} showed that refusal is mediated by a near one-dimensional feature in the residual stream."
  • Representation hijacking: When internal representations of benign tokens shift to resemble harmful concepts. "This dissociation mirrors text-domain representation hijacking~\cite{yona2024doublespeak},"
  • RLHF (Reinforcement Learning from Human Feedback): An alignment method that uses human preference signals to shape model behavior via RL. "Safety alignment techniques for LLMs include RLHF~\cite{ouyang2022training}, Constitutional AI~\cite{bai2022constitutional}, and DPO~\cite{rafailov2023direct}."
  • SigLIP: A vision–language pretraining approach that uses a sigmoid loss variant for image–text alignment. "Modern VLMs combine pretrained vision encoders (e.g., CLIP~\cite{radford2021learning}, SigLIP~\cite{zhai2023siglip}) with LLM backbones..."
  • Typography-based attacks: Jailbreaks that render harmful prompts as images of text to leverage OCR. "Typography-based attacks exploit VLMs' OCR capabilities: FigStep~\cite{gong2024figstep} converts harmful instructions into typographic images,"
  • Universal adversarial image: A single image crafted to induce harmful behavior across varied prompts or models. "\citealt{qi2024visual} demonstrated that a single universal adversarial image can jailbreak multiple harmful instructions."
  • Visual Analogy Riddle: An attack that encodes prohibited concepts via benign analogies that must be solved. "Visual Analogy Riddle: Visual analogies whose solution requires inferring a prohibited concept through implicit reasoning."
  • Visual Cipher: An attack that encodes harmful text as glyph sequences with a legend for decoding. "Visual Cipher: Harmful instructions encoded as abstract glyph sequences with a visual decoding legend."
  • Visual embeddings: Vector representations of images produced by the vision encoder for fusion with language. "safety training on text does not automatically transfer to visual embeddings."
  • Visual Object Replacement: An attack that swaps a harmful object with a benign-looking one while context implies the original. "Visual Object Replacement: Harmful objects replaced with benign substitutes (e.g., bomb \rightarrow banana) while scene context preserves the original implication."
  • Visual Text Replacement: An attack that replaces harmful text in images with benign words while context suggests the original meaning. "Visual Text Replacement: Harmful text in images (e.g., book covers) replaced with benign words while visual and cultural context implies the original referent."
  • Vision encoders: Models that transform images into embeddings usable by language backbones. "Modern VLMs combine pretrained vision encoders (e.g., CLIP~\cite{radford2021learning}, SigLIP~\cite{zhai2023siglip}) with LLM backbones..."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 23 tweets with 128 likes about this paper.