Typographic Attacks on LVLMs

Updated 22 September 2025

The paper demonstrates that typographic attacks embed misleading text in images to force LVLMs into misclassifying inputs and bypassing safety protocols.
It reveals that adversaries use techniques like prompt injection, scene-coherent cues, and artifact-based manipulations, achieving up to 90% transferability across models.
Defense strategies such as prompt engineering and DPS offer partial mitigation, underscoring the persistent challenge of securing LVLMs against evolving typographic threats.

Typographic attacks on Large Vision-LLMs (LVLMs) represent a class of adversarial manipulations where misleading or malicious textual content is visually embedded within images, exploiting the model’s reliance on text-visual correlations. These attacks degrade multimodal reliability, can bypass conventional safety mechanisms, and present significant security risks in diverse applications such as digital assistants, autonomous systems, and content moderation. Typographic threat research is extensive, demonstrating impactful vulnerabilities, proposing taxonomy and evaluation frameworks, and exploring both digital and physical contexts.

1. Foundations and Mechanisms of Typographic Attacks

Typographic attacks exploit the heavy reliance of LVLMs on visible text within images. LVLMs habitually leverage textual cues—often explicitly or implicitly learned from web-scale datasets—to inform predictions and reasoning even when the primary visual content is in conflict with these cues. A basic typographic attack involves overlaying class-matching or misleading text (phrases, labels, or instructions) onto an image, with the intent to:

Shift model predictions toward the embedded cue (e.g., misclassifying a cat as “Somali” when that word is added to the image (Qraitem et al., 1 Feb 2024, Cheng et al., 29 Feb 2024, Westerhoff et al., 7 Apr 2025)).
Bypass text-based and cross-modal safety alignment by using OCR or image encoders, especially for instructional or harmful content (e.g., jailbreaking via typographic stepwise instructions (Gong et al., 2023)).
Weaponize non-class-matching cues (“artifacts”) that associate with target classes by dataset-induced correlations—such as logos, symbols, and even partial or misspelled words (Qraitem et al., 17 Mar 2025).

The attack’s general mechanism can be formalized as a transformation τ, altering an image $v$ to include an adversarial artifact (textual or graphical), producing $\hat{v} = \tau(v)$ . The attacked LVLM, given $(\hat{v}, t_{in})$ , is thereby induced to output a target prediction or to violate safety constraints.

2. Taxonomy and Variants

Typographic attacks span several subtypes, distinguishable by content, operational context, and the complexity of the adversarial cue:

Variant	Cue Type	Targeted Mechanism
Jailbreaking/Prompt Injection	Harmful instructions	Bypass text safety via OCR
Class-matching	Literal class name	Exploit text-visual correlation
Semantic similarity	Phonetic/visual sim.	Overlap in embedding space
Artifact-based	Logos/symbols	Dataset-induced spurious leakage
Scene-coherent/planned	Contextual text	Placement-aware, visually natural
Physical-world (“Patch”)	Real printed objects	Robust to photo recapture, mobile

Notable developed methods include FigStep (typographic image for jailbreak) (Gong et al., 2023), scene-coherent adversarial planners (SceneTAP) (Cao et al., 28 Nov 2024), artifact search (combining OCR and CLIP-based retrieval) (Qraitem et al., 17 Mar 2025), and multi-image text selection strategies (text-image similarity for stealth and efficiency) (Wang et al., 12 Feb 2025).

3. Attack Efficacy, Empirical Results, and Transferability

Empirical studies consistently demonstrate that typographic attacks cause substantial degradation in LVLM performance across both open-source and commercial models. Highlights include:

Jailbreak via Typographic Prompts: FigStep achieves an average attack success rate (ASR) of 82.50% versus 44.80% for text-only baselines on major open-source LVLMs (Gong et al., 2023). Enhanced variants (e.g., FigStep-Pro) increase ASR to 70% on otherwise robust models such as GPT-4V.
Classification Corruption: Real-world hand-written attacks in the SCAM dataset produce a ~26 percentage point drop in zero-shot classification accuracy for CLIP-based and LVLM models (Westerhoff et al., 7 Apr 2025).
Robustness Variation: Benchmarks like REVAL show performance drops of 16–30% for high-profile models (e.g., GPT-4o, Gemini-1.5-Pro, GLM-4V-9B) under typographic attacks (Zhang et al., 20 Mar 2025).
Physical and Scene-Coherent Attacks: Robust attacks persist even when patches are printed and placed in real scenes (demonstrated against ChatGPT-4o and others) (Cao et al., 28 Nov 2024).
Transferability: Artifact-based and similarity-guided attacks, optimized on one model, transfer with up to 90% success to unseen architectures (Qraitem et al., 17 Mar 2025, Wang et al., 12 Feb 2025). This suggests a shared vulnerability linked to common pretraining objectives and visual-text alignment.

4. Causal Analysis, Attentional Dynamics, and Underlying Vulnerabilities

Research reveals several common factors contributing to typographic vulnerability:

Vision Encoder Distraction: Grad-CAM analyses show that vision encoders (especially CLIP variants) devote disproportionate attention to inserted text, diminishing focus on genuine visual content (Cheng et al., 29 Feb 2024).
Cross-Modal Misalignment: Safety alignment, robust in textual modules via techniques such as RLHF, fails to transfer to the visual pathway, allowing harmful semantic content to propagate when encoded as typography (Gong et al., 2023).
Factor Sensitivity: Performance drops are modulated by font size, opacity, color, and spatial position of the attack text (Cheng et al., 29 Feb 2024, Cheng et al., 14 Mar 2025); larger, high-contrast, and well-placed cues are more disruptive.
Model Architecture and Training Data: Vulnerability is correlated with vision encoder choices (ViT vs. RN, CLIP vs. SigLIP), LLM backbone size, and the filtering in pretraining datasets (Westerhoff et al., 7 Apr 2025). Larger LLMs and careful curation mitigate, but do not eliminate, susceptibility.

5. Real-World and Digital Application Domains

Typographic attacks generalize across standard, orchestrated settings and real-world deployments:

Digital Assistant/Jailbreak: Attackers bypass prompt filtering by visually encoding disallowed instructions, compelling unsafe completions even in heavily aligned models (Gong et al., 2023).
Autonomous Systems: Adversarial text and directives injected into traffic scenes (on signs, vehicles, or billboards) disrupt perception, counting, object recognition, and control reasoning (Chung et al., 23 May 2024).
Multimodal Content Generation: Image-to-image generators can be hijacked to align output semantics with injected visual cues, as measured by metrics such as CLIPScore and FID (Cheng et al., 14 Mar 2025).
Watermark Removal and Copyright Evasion: Character-level typographic manipulations disrupt LLM tokenization, efficiently lowering watermark scores and enabling undetectable misuse (Zhang et al., 11 Sep 2025).
Stealth and Multi-Image Strategies: By optimizing text-image similarity and rigorously non-repeating attack texts, adversaries maintain stealth while achieving a 21% improvement in attack rates (Wang et al., 12 Feb 2025).
Malicious Font Injection: Manipulation of code-to-glyph mappings enables delivery of hidden adversarial prompts within seemingly benign documents; this is notably effective in advanced tool-integrated LLM systems (Xiong et al., 22 May 2025).

6. Defense Methodologies and Remaining Challenges

A range of defense measures have been studied, though none achieve comprehensive protection against sophisticated typographic attacks:

Prompt Engineering: Strong prompts (e.g., “ignore text in the image”) marginally improve resistance, but fail against scene-coherent or contextually blurred attacks (Cao et al., 28 Nov 2024, Qraitem et al., 1 Feb 2024).
Partial-Perception Supervision (DPS): By combining responses from models restricted to cropped or partial images, typographic attacks are diluted, reducing ASR by 76.3% and outperforming prior ensemble or voting techniques (Zhou et al., 17 Dec 2024).
Artifact-Aware Prompting: Explicitly incorporating detected artifacts or textual cues into downstream prompts provides a moderate reduction in attack success rates (up to 15%) (Qraitem et al., 17 Mar 2025).
Robust Pretraining and Alignment: Filtering training data to discourage spurious text-correlated associations (e.g., using CommonPool instead of LAION), and leveraging larger LLM backbones, offer heightened resilience—but cannot prevent all classes of contextual attacks (Westerhoff et al., 7 Apr 2025).
Scene-Coherence Disruption: Future defense research is directed toward integrating semantic segmentation, selective filtering (e.g., Segment Anything Model), and context-consistent attention masking (Zhou et al., 17 Dec 2024, Cao et al., 28 Nov 2024).
Watermark Invariance: For watermark preservation, strategies must address tokenization sensitivity to typographic noise, potentially via error-correcting coding or higher-level linguistic embedding (Zhang et al., 11 Sep 2025).

Remaining open problems include defending against physical-world attacks, detection and filtering of sophisticated scene-coherent typography, and developing universal, cross-modal alignment mechanisms robust to dynamic and context-dependent adversarial cues.

7. Benchmarking, Datasets, and Evaluation Frameworks

Robust assessment of typographic vulnerabilities has been enabled by recent datasets and evaluation protocols:

TypoD (Cheng et al., 29 Feb 2024): The largest-scale systematic dataset, supporting factor control (font size, color, opacity, position) across both vision-perceptual and high-level reasoning tasks.
SCAM (Westerhoff et al., 7 Apr 2025): A real-world dataset of 1,162 images with handwritten attack words, spanning hundreds of categories; synthetic attacks (SynthSCAM) are validated to match physical-world effectiveness.
TVPI (Cheng et al., 14 Mar 2025): Systematic testbed for typographic visual prompt injection in both LVLM and I2I generative model settings, with structured subtasks (object, color, quantity, size).
REVAL (Zhang et al., 20 Mar 2025): Integrates typographic attack scenarios into a comprehensive mixed-task, multi-metric LVLM evaluation suite.
Attack Success Rate (ASR): Standard metric measuring the fraction of adversarial samples where the targeted model outputs conform to the attacker’s specified target.
Other Metrics: CLIPScore and FID for generative tasks; naturalness and comprehensive scores for scene-coherent attack evaluation; confusion matrices and attention visualizations for interpretability.

These benchmarks have facilitated precise cross-model comparisons, enabled ablation and causal analyses, and driven exploration of attack transferability and defense robustness.

Typographic attacks constitute a critical, persistent, and evolving threat vector for LVLMs. Their efficacy arises from exploiting the intersection of visual and textual pathways, capitalizing on training-induced correlations, and leveraging weaknesses in cross-modal attention and alignment. Despite rapid progress in detection and defense, robust solutions remain elusive—especially as attack strategies grow in sophistication (scene coherence, non-repetition, artifact search). Ongoing research, informed by comprehensive datasets and rigorous evaluation, is essential to hardening LVLMs against these attacks and ensuring trustworthy deployment in safety- and security-critical contexts.