Vision-Text Compression (VTC) Overview

Updated 24 December 2025

Vision-Text Compression (VTC) is a method that converts text and vision inputs into dense visual tokens to reduce memory and computational load.
VTC techniques, including vision-only pruning, text-guided selection, and token distillation, achieve compression ratios from 3× to 576×.
Empirical results reveal that advanced VTC architectures maintain high OCR accuracy and enable significant inference speedups in multimodal models.

Vision-Text Compression (VTC), also known as optical context compression, is an emerging paradigm for reducing the computational and memory costs associated with multimodal LLMs (MLLMs) and vision-LLMs (VLMs). By rendering text or vision-language inputs into dense visual representations, VTC exploits the high information density of images to achieve token compression ratios between 3× and 576×, trading off some fidelity for dramatic efficiency improvements. This article provides a comprehensive overview of VTC: its core methodologies, representative algorithms, benchmarking challenges, empirical findings, and open research problems, as reported in current literature.

1. Problem Definition and Motivation

VTC addresses the fundamental scaling bottleneck of MLLMs and VLMs: the quadratic complexity of self-attention with respect to input sequence length. As LLMs and VLMs extend their context windows to encompass hundreds of thousands or even millions of tokens, memory and FLOPs escalate exponentially. Text sequences are first rendered as one or more compact 2D images using configurable typography and layout; these are then processed by vision encoders to yield a significantly reduced number of visual tokens. The formal VTC ratio is defined as

$R = \frac{N_T}{N_I}$

where $N_T$ is the number of original (text or patch) tokens and $N_I$ is the number of resulting visual tokens. Representative VTC ratios in the literature include 10× for DeepSeek-OCR (text-to-image, OCR-style), up to 576× for VoCo-LLaMA’s token distillation, and 3–8× for multi-page benchmarks such as Glyph and VTCBench (Wei et al., 21 Oct 2025, Cheng et al., 20 Oct 2025, Ye et al., 18 Jun 2024, Zhao et al., 17 Dec 2025, Yang et al., 11 Jun 2024, Xing et al., 2 Feb 2025, Chen et al., 2 Sep 2024, Li et al., 1 Apr 2025, Wang et al., 8 Dec 2024, Zhu et al., 21 Nov 2024, Lu et al., 27 Mar 2025).

The canonical workflow is:

Render the full or partial context to an image with a specified rendering configuration (e.g., font, dpi, line spacing).
Encode the resulting image via a vision backbone, projecting into a small number of visual tokens using token compression modules.
Feed these visual tokens (possibly mixed with retained text tokens) into the LLM or VLM for downstream inference or generation.

This approach is motivated by the need to scale context length without incurring prohibitive resource use, and by the observation that much of the input, especially in vision or long text, is redundant or irrelevant for downstream tasks (Wei et al., 21 Oct 2025, Cheng et al., 20 Oct 2025, Xing et al., 2 Feb 2025).

2. Compression Algorithms and Architectures

The literature on VTC encompasses a spectrum of algorithmic strategies, including pure visual token pruning, hybrid vision-text selection, and self-distilled compression schemes. Notable categories include:

Vision-only Pruning: Utilizes pretrained vision encoder attention to estimate token importance, dropping low-importance patches. Methods such as VTC-CLS use [CLS] token attention across layers to select tokens, achieving up to 1.6× speedup without retraining (Wang et al., 8 Dec 2024).
Question-/Text-guided Pruning: Integrates textual input (e.g., task instruction, question) to score visual tokens by relevance, either during compression (QG-VTC) or by dynamic recovery (Recoverable Compression) (Li et al., 1 Apr 2025, Chen et al., 2 Sep 2024). Correlation scores between text and visual token embeddings are computed, and only the most informative tokens are retained.
Coarse-to-fine Sampling: FocusLLaVA introduces a two-stage system combining a vision-guided multi-scale downselection and a text-guided sampler within the LLM, reducing visual tokens to 39% of baseline with simultaneous accuracy improvements (Zhu et al., 21 Nov 2024).
Latent Compression Learning: LCL unifies image and text tokens in a causal transformer, learning compressed representations that maximize mutual information between inputs and outputs; contrastive and generative objectives are balanced (Yang et al., 11 Jun 2024).
Context Rendering and Optical Compression: DeepSeek-OCR and Glyph render context as images and decode via OCR or tokenizing VLMs, attaining 3–10× compression while sustaining >96% OCR accuracy for r<10 (Wei et al., 21 Oct 2025, Cheng et al., 20 Oct 2025, Zhao et al., 17 Dec 2025).
Token Distillation: VoCo-LLaMA introduces learnable “VoCo” tokens distilled directly inside the LLM via modified attention masking and KL-divergence to the teacher (Ye et al., 18 Jun 2024).
Hybrid Compression Modules: InternVL-X deploys multiple modules (PVTC, LVTC, RVTC) for local-global aggregation, layer-wise expansion, and resolution-adaptive token slicing, enabling further acceleration with minimal loss (Lu et al., 27 Mar 2025).
Token Recovery and Soft Merging: Techniques dynamically recover tokens relevant to the text query using local outlier factor analysis and merge residual tokens by clustering to keep bandwidth as low as 10% of baseline (Chen et al., 2 Sep 2024).

The following table summarizes representative methods:

Method	Compression Target	Principle
VTC-CLS	Vision-only pruning	[CLS] attention ranking
QG-VTC	Joint guidance	Text-query guided selection
FocusLLaVA	Coarse-to-fine	Vision/text-guided sampling
DeepSeek-OCR	Text→image rendering	End-to-end OCR decoding
VoCo-LLaMA	Token distillation	LLM-learned VoCo tokens
InternVL-X	Multi-module	Projector, layerwise, slice
RecoverableCmp.	Text recovery	LOF + cluster merging

3. Evaluation Methodologies and Benchmarks

Evaluation of VTC algorithms presents significant challenges:

Benchmark Mismatch: Standard MLLM tasks (e.g., GQA, MMBench, MMStar) were designed for perception and reasoning, not compression. They contain a preponderance of “easy” queries solvable after severe downsampling, leading to low discriminativity for compression studies (Liao et al., 8 Oct 2025).
VTC-Bench Framework: To address this, VTC-Bench introduces a data filtering mechanism where image downsampling itself is used to partition samples into “easy” (solved by downsampling) and “difficult” (only solvable with retention of fine details); evaluation statistics are reported only on the difficult subset (Liao et al., 8 Oct 2025).
Compression Metrics: Common metrics include accuracy drop ( $\Delta\mathrm{Acc}$ ), compression ratio ( $r=1-C$ ), and speedup ( $S=T_\mathrm{orig}/T_\mathrm{method}$ ), all reported per difficult-instance.
Specialized Benchmarks: VTCBench, VTCBench-Wild, and composite tasks (retrieval, reasoning, dialogue/memory) probe the true long-context and associative capabilities of VLMs under VTC. Results suggest that VLMs perform well on OCR-style and retrieval metrics at k=1k tokens but degrade rapidly for associative, multi-hop, or “middle-distant” tasks, unless specifically trained for 2D context handling (Zhao et al., 17 Dec 2025).
Resilience to Render Variations: Diversity in font size, rendering style, layout, and text location can have pronounced effects; style-robustness is an open requirement.

4. Empirical Performance and Trade-offs

Key empirical findings across recent VTC work include:

Relative Efficiency: For token reduction targets at or above 10×, Vision-Text compression can achieve 96–98% reconstruction accuracy on English and Chinese OmniDocBench using less than one-sixth the vision tokens employed by prior baselines (Wei et al., 21 Oct 2025).
Downsampling as a Baseline: Against expectation, trivial image downsampling outperforms complex pruning/merging methods on raw benchmarks, essentially because such benchmarks are saturated with easy cases (Liao et al., 8 Oct 2025).
Hierarchical and Query-Guided Compression: Hierarchical encoders and question-guided pruning consistently retain higher accuracy under severe compression ratios (outperforming random, [CLS]-only, or one-shot hard pruning), with QG-VTC retaining >94% of baseline with just 12.5% of tokens on six VQA benchmarks (Li et al., 1 Apr 2025).
System-wide Gains: Methods such as InternVL-X’s composite pipeline can reduce vision tokens to 20–25%, cutting FLOPs by 50–95% and speeding up inference by 69% with negligible loss or even performance gains on certain benchmarks (Lu et al., 27 Mar 2025, Ye et al., 18 Jun 2024).
Fundamental Limitations: Optical autoencoding approaches (e.g., DeepSeek-OCR) are nearly optimal on reconstruction but can fall short on true language modeling or association benchmarks. Hierarchical text-based compression baselines often yield better perplexity under equivalent ratios (Lee et al., 3 Dec 2025).
Upper Bound for "Lossless" Compression: Empirically, r~10 marks the effective upper limit for near-lossless OCR-style compression; beyond this, layout, legibility, and spatial ambiguity degrade performance substantially (Wei et al., 21 Oct 2025).

5. Limitations, Failure Modes, and Open Problems

Several non-trivial technical limitations and open problems are recurrently identified:

Semantic and Associative Reasoning: Most VLMs fail on associative reasoning and memory tasks under VTC, even when OCR and retrieval metrics remain high. The “lost in the middle” problem—where mid-context information in the rendered image is inaccessible—mirrors issues in 1D LLMs, but with additional 2D spatial pathologies (Zhao et al., 17 Dec 2025).
Overfitting to Shortcut Cues: Standard benchmarks often lead sophisticated token selection methods to overfit toward preserving globally salient tokens, ignoring task-specific or instruction-specific requirements (Liao et al., 8 Oct 2025, Wang et al., 8 Dec 2024).
Hyperparameter Sensitivity: Methods relying on LOF, clustering, or fixed pruning schedules are sensitive to dataset-specific parameters; brittle or sub-optimal settings can erase the gains of VTC (Chen et al., 2 Sep 2024, Li et al., 1 Apr 2025).
Robustness to Render Variability: Changes in font size, rendering parameters, and tile order can dramatically decrease VLM performance, pointing to lack of style-agnostic generalization (Zhao et al., 17 Dec 2025).
Resolution-Semantic Gap: Extreme compression typically demands low font sizes or image resolutions, potentially rendering text unreadable to the vision backbone (Wei et al., 21 Oct 2025, Xing et al., 2 Feb 2025).

6. Prospective Solutions and Research Directions

Current research proposes several technical avenues to address the observed shortcomings of existing VTC systems:

Benchmarks and Supervision: Developing new VTC benchmarks focused on difficult, high-associativity, and reasoning-intensive samples is critical. Pre-training on both OCR and context reasoning over rendered pages is recommended (Liao et al., 8 Oct 2025, Zhao et al., 17 Dec 2025).
Hybrid and Adaptive Models: Combining VTC with lightweight textual summarization, chunking, or adaptive layout selection offers a plausible path to balance compression and legibility. Dynamic adjustment of compression based on query or content complexity is an open challenge (Zhao et al., 17 Dec 2025, Li et al., 1 Apr 2025).
Spatial Positioning and Graph Attention: Implementing 2D positional encodings and attention patterns to capture grid structure, rather than only sequential dependencies, is necessary for associative and memory tasks (Zhao et al., 17 Dec 2025).
Hierarchical/Fine-Grained Compression: Multi-level, query-aware token compression that dynamically selects critical regions or selectively upsamples complicated contexts can mitigate information loss without overburdening resource use (Lu et al., 27 Mar 2025, Zhu et al., 21 Nov 2024, Li et al., 1 Apr 2025).
Learned Modality Crosswalks: There is active exploration into joint digital-optical pretraining, loss functions that directly optimize context retrieval and reasoning, and cross-modal compression applicable to other modalities (e.g., audio, video) (Wei et al., 21 Oct 2025, Xing et al., 2 Feb 2025, Yang et al., 11 Jun 2024).

7. Conclusion

Vision-Text Compression represents a powerful paradigm shift in the design of scalable MLLMs and VLMs, enabling multi-modal models to process ultra-long contexts with a fraction of the compute and memory historically required. While empirical work demonstrates that aggressive compression via rendering and advanced pruning is feasible with negligible accuracy loss on standard OCR and retrieval tasks, genuine advances in associative reasoning, dialogue memory, and context retrieval await the development of benchmarks and architectures explicitly tailored to the unique challenges of high-density, 2D encoded information. The integration of adaptive, query-guided, and spatially aware compression algorithms is poised to define the next stage of research in this area (Wei et al., 21 Oct 2025, Liao et al., 8 Oct 2025, Zhao et al., 17 Dec 2025, Lu et al., 27 Mar 2025, Xing et al., 2 Feb 2025, Wang et al., 8 Dec 2024, Zhu et al., 21 Nov 2024, Li et al., 1 Apr 2025, Chen et al., 2 Sep 2024, Ye et al., 18 Jun 2024, Yang et al., 11 Jun 2024, Lee et al., 3 Dec 2025, Cheng et al., 20 Oct 2025).