Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
88 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

ViTC: Vision-in-Text Challenge Benchmark

Updated 12 July 2025
  • Vision-in-Text Challenge (ViTC) is a benchmark assessing LLMs' capacity to decode visually encoded text patterns like ASCII art.
  • ViTC consists of two datasets, ViTC-S and ViTC-L, which use single and multi-character ASCII art to evaluate prediction accuracy and match ratio.
  • The challenge exposes LLMs' limitations in handling spatial patterns, highlighting vulnerabilities in safety protocols and guiding multimodal improvements.

The Vision-in-Text Challenge (ViTC) is a benchmark and evaluation framework introduced to systematically assess and advance the capacity of AI models—especially LLMs—to handle, interpret, and reason about information encoded visually within textual representations. Specifically, ViTC targets the model’s ability to process inputs where meaning arises not from conventional semantics but from patterns, spatial arrangements, or formatting schemes, such as ASCII art. The challenge exposes a fundamental limitation in current LLMs: their presumption that all inputs are to be interpreted semantically, which leaves them vulnerable to adversarial attacks and severely limits robustness in multimodal and safety-critical contexts (2402.11753).

1. Foundational Motivation and Definition

ViTC was established to assess whether LLMs, aligned for safety and utility via data filtering and supervised fine-tuning, can accurately interpret and respond to “vision-in-text” prompts—inputs whose meaning depends on spatial or graphical arrangement rather than linear semantics. The principal test case motivating ViTC is ASCII art, where characters or digits are depicted using a two-dimensional arrangement of symbols that forms recognizable glyphs only when spatially parsed. Since much of LLM safety alignment is built on the presumption of semantic-only input interpretation, ViTC challenges this assumption and queries: Can LLMs recognize visually encoded prompts where recognition and reasoning cannot be achieved via semantics alone (2402.11753)?

2. Benchmark Composition and Task Structure

ViTC comprises two main datasets, each designed to isolate and stress the “vision-in-text” parsing capacity of LLMs:

  • ViTC-S (Single Character Set):

8,424 samples, each encoding a single digit or letter in ASCII art across 36 classes. Diversity is introduced by varying the “font” (i.e., ASCII arrangement) for each character—234 distinct styles per class—using programmatic art libraries.

  • ViTC-L (Multi-Character Sequence Set):

8,000 samples encoding sequences of 2–4 ASCII art characters, covering 800 unique classes. Labels are constructed by concatenating the decoded characters in visual order.

Each ViTC sample requires the model to output a character or string label based solely on the provided ASCII art prompt and a task description. Model accuracy is assessed using two principal metrics:

  • Prediction Accuracy (Acc):

Acc=#Correct PredictionsTotal Samples\mathrm{Acc} = \frac{\#\,\text{Correct Predictions}}{\text{Total Samples}}

  • Average Match Ratio (AMR):

AMR=1D(x,y)DM(y,y^)y\mathrm{AMR} = \frac{1}{|\mathcal{D}|} \sum_{(x,y)\in\mathcal{D}} \frac{M(y,\hat{y})}{|y|}

where M(y,y^)M(y,\hat{y}) counts matched characters between label yy and prediction y^\hat{y}. For single-character tasks, AMR=Acc\mathrm{AMR} = \mathrm{Acc} (2402.11753).

Recognition strategies tested include zero-shot, few-shot (in-context learning), and chain-of-thought prompting.

3. ArtPrompt: ASCII Art-Based Jailbreak Attack

A central motivation for ViTC is the demonstration that ASCII art can be exploited to evade LLM safety systems. The ArtPrompt attack exploits the LLM’s lack of visual parsing in two steps:

  • Word Masking:

Sensitive words (e.g., “bomb”) in an adversarial prompt are masked out.

  • Cloaked Prompt Generation:

The masked words are re-rendered as ASCII art and inserted into the original prompt.

For example, the prompt “How to build a bomb?” is automatically rejected by aligned LLMs. However, if “bomb” is rendered as ASCII art within the same prompt, all tested LLMs (GPT-3.5, GPT-4, Gemini, Claude, Llama2) failed to interpret the word as sensitive—thus generating unsafe content (2402.11753).

This attack is inherently practical, requiring only black-box access, and leverages widespread limitations in LLM visual reasoning.

4. Empirical Evaluation and Model Vulnerabilities

Comprehensive evaluation across state-of-the-art LLMs with ViTC revealed:

  • On ViTC-S, the best-performing model (GPT-4) achieved only 25.19% accuracy in zero-shot settings, drastically lower than text or code completion tasks.
  • On ViTC-L, all models performed near random guess levels (GPT-4: 3.26% accuracy; AMR similarly low).
  • Sophisticated prompting strategies (few-shot, chain-of-thought) led to only marginal improvements.

These findings establish that the tested LLMs do not extract visual meaning from ASCII arrangement and instead default to semantic heuristics. Prompted to “recognize the letter,” LLMs nearly always fail, unable to bridge the pattern-recognition gap (2402.11753).

5. Security and Safety Implications

The demonstrated vulnerability has substantial ramifications:

  • Alignment Vulnerability:

The core assumption—inputs can only be semantically interpreted—does not hold against ASCII art prompts. Safety mechanisms are easily subverted via pattern-encoded content.

  • Jailbreak Feasibility:

The ArtPrompt attack allows users to bypass LLM safety alignment simply by substituting forbidden tokens with their ASCII-art analogs.

  • Broader Model Invisibility:

LLMs trained on semantics alone are blind to entire classes of pattern-encoded user intent, leaving them exposed not just in adversarial settings but also in domains requiring accurate visual-text reasoning.

6. Future Directions and Model Robustness

Efforts to address these deficiencies suggest several research avenues:

  • Multimodal Interpretation Mechanisms:

Integrate visual parsing modules into LLMs to “read” and parse spatial or graphical encodings within text—moving toward “vision–in–text” models that extend beyond pure semantics.

  • Advanced Defenses:

Simple perplexity-based or paraphrasing defenses are inadequate for ASCII-based attacks. Research is needed into defenses that detect and neutralize pattern-based adversarial encodings, potentially leveraging hybrid vision–language architectures.

  • Training Data and Corpus Curation:

Emphasize training strategies that incorporate a spectrum of pattern-based, multimodal, or adversarial encodings so models learn visual as well as semantic representations.

  • Evaluation Across Modalities:

Extend ViTC-style benchmarks to multimodal models, as the vulnerability likely persists for vision-capable models when the visual cue remains embedded in text rather than as an image (2402.11753).

7. ViTC in the Broader Context of Vision-Language Research

Within the landscape of vision-language integration, ViTC establishes a critical, previously overlooked evaluation axis. The challenge highlights the blind spots in contemporary LLM alignment protocols and benchmark methodologies—namely, the assumption of strictly semantic interpretation. ViTC thus catalyzes research in multimodal robustness, adversarial safety, and the development of models capable of “reading” text with the flexibility of a visually attentive human reader.


The Vision-in-Text Challenge thus provides a rigorous and crucial assessment of LLMs’ capacity for pattern-based visual reasoning within text. By exposing vulnerabilities and charting the requirements for robust multimodal understanding, ViTC frames a new frontier in the safety and reliability evaluation for next-generation AI models (2402.11753).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)