Vision Token Compression

Updated 17 September 2025

Vision token compression is a set of methods designed to reduce computational load by pruning, merging, or transforming tokens in high-resolution images and video streams, preserving key information.
The approach classifies methods into static/dynamic pruning, similarity-based merging, and unified many-to-many transformations, often guided by attention or text-based metrics.
Empirical results show significant resource savings and throughput gains with minimal accuracy loss, enabling efficient deployment on diverse hardware and real-time multimodal applications.

Vision token compression encompasses a family of techniques aimed at reducing the computational and memory overhead of handling large token sequences in vision transformers and multimodal LLMs (MLLMs), especially in scenarios where high-resolution images or long video streams are tokenized into hundreds or even thousands of embeddings. The fundamental objective is to prune, merge, or transform these tokens in a manner that minimizes information loss while achieving significant acceleration and resource savings in inference and training. Compression methods span a spectrum from static pruning to adaptive, text-/task-guided and unified many-to-many assignment frameworks, and are central to accelerating vision-LLMs for large-scale and resource-constrained deployment.

1. Core Strategies and Methodological Taxonomy

Vision token compression methods systematically fall into three principal categories: pruning, merging, and hybrid or unified transformation frameworks (Nguyen et al., 13 Jul 2025).

Token pruning involves scoring and dropping less informative visual tokens based on metrics such as class-token attention (e.g., EViT, DynamicViT), gradient-weighted attention (Mao et al., 30 Mar 2025), visual saliency, or text-guided relevance (Chen et al., 2 Sep 2024, Li et al., 1 Apr 2025). Pruning can use static (fixed keep rate) or dynamic (input-dependent rate) policies.
Token merging groups tokens with similar features or function into single tokens to exploit local or global redundancy (e.g., ToMe, PatchMerger), supporting both hard and soft assignments depending on whether input tokens are exclusively merged or distributed fractionally to outputs.
Hybrid and unified frameworks (e.g., DiffRate (Chen et al., 2023), TokenCarve (Tan et al., 13 Mar 2025), Token Transforming (Zeng et al., 6 Jun 2025)) combine both pruning and merging, often deploying information contribution metrics, attention affinity, or differentiable proxies to optimize token compression in an adaptive and often training-free manner.

The recent literature also introduces advanced concepts such as progressive encoding across frames for video (Yang et al., 12 Dec 2024), layer-wise pixel-shuffle with residual learning (Liu et al., 3 Jul 2025), and global-local scoring using dual expert modules for instruction-aware compression (Li et al., 17 May 2025, Zhu et al., 21 Nov 2024, Li et al., 1 Apr 2025).

2. Key Algorithms and Technical Implementations

Mechanisms for vision token compression rely on the integration of token importance estimation, assignment optimization, and token update or merging formulations:

Pruning via Importance Scoring: Tokens are ranked by metrics such as attention affinity to the class token, SVD-derived information contribution scores (Tan et al., 13 Mar 2025), or gradient-weighted attention (Mao et al., 30 Mar 2025). Dynamic schemes use text-guided or instruction-aware scores for multimodal tasks (Chen et al., 2 Sep 2024, Li et al., 17 May 2025, Li et al., 1 Apr 2025).
Merging via Similarity-Based Assignment: Tokens are combined via weighted averages or convex combinations parameterized by attention, similarity matrices, or region-wise spatial aggregation (e.g., hard merging via clustering, soft merging via normalized assignments).
Unified Many-to-Many Transformation: The Token Transforming framework (Zeng et al., 6 Jun 2025) generalizes pruning and merging as a linear transformation:

$Y = W \cdot X$

where $W$ (learned or derived from similarities and gating) encodes soft many-to-many contributions, preserving more information than exclusive assignments. TokenCarve (Tan et al., 13 Mar 2025) deploys SVD to rank tokens for pruning and guides merging, preserving output matrix rank.

Progressive and Layer-wise Approaches: PVC (Yang et al., 12 Dec 2024) employs a causal temporal attention mechanism so that each frame or repeated static image only incorporates novel information, while LaCo (Liu et al., 3 Jul 2025) compresses tokens inside the vision encoder using pixel-shuffle and residual channel averaging, outperforming post-encoder approaches.

Prune tokens based on learnable or attention-driven scores into reserved ${S_r}$ and pruned ${S_p}$ sets;
For each pruned token $x_i \in S_p$ , find "host" reserved token $x_j \in S_r$ via cosine similarity:

$c_{i,j} = \frac{x_i^\top x_j}{\|x_i\| \|x_j\|}$

Fuse information: the host reserved token is updated as

$y_j = w_j x_j + \sum_{x_i \in S_p} w_i x_i$

where weights are normalized exponentials of $c_{i,j}$ .

3. Adaptive and Task-Guided Compression

Compression methods increasingly leverage task and context information:

Text and Question-Guided Compression: Correlation between visual tokens and question embeddings is computed; the most relevant tokens are retained, while less relevant ones are softly merged (QG-VTC (Li et al., 1 Apr 2025), FocusLLaVA (Zhu et al., 21 Nov 2024)). Recoverable Compression (Chen et al., 2 Sep 2024) uses external text guidance to reclaim tokens disregarded by visual-only pruning.
Instruction-aware Global-Local Selection: Methods such as FocusLLaVA (Zhu et al., 21 Nov 2024) and LLaVA-Meteor (Li et al., 17 May 2025) use instruction tokens or intermediate language representations to co-guide token selection after an initial vision-based reduction.
Layer-wise Progressive Schemes: Token numbers are reduced at successive encoder depths, accompanied by non-parametric residuals to preserve critical information (Liu et al., 3 Jul 2025), or by upsampled expansion in later LLM layers (Lu et al., 27 Mar 2025).

This adaptive design ensures the model dynamically reallocates resources to relevant tokens, crucial for efficiency in complex multimodal reasoning and high-resolution visual inputs.

4. Performance, Efficiency, and Empirical Results

Compression efficacy is evaluated using accuracy on classification or multimodal benchmarks, FLOPs, throughput, cache/storage reduction, and performance trade-off curves:

TPS (Wei et al., 2023): Improves DeiT-small accuracy by 1–6% at 35% of original FLOPs; 1.64× throughput gain; >4% accuracy gain on DeiT-tiny baseline under similar budget.
DiffRate (Chen et al., 2023): Reduces FLOPs by 40% and inference time by 1.5× with 0.16% accuracy drop on ViT-H; learns layer-wise compression rates for each block.
VoCo-LLaMA (Ye et al., 18 Jun 2024): Achieves a 576× token reduction, 94.8% fewer FLOPs, 69.6% inference speedup, with 83.7% retention of full input visual performance.
TokenCarve (Tan et al., 13 Mar 2025): Reduces tokens to 22.2% (from 576 to 128), 1.23× speedup, 64% reduction in KV cache, with only 1.54% accuracy lost.

A table summarizing representative results:

Method	Token Reduction Ratio	Throughput Gain	Accuracy Δ
TPS (Wei et al., 2023)	~65%	1.03–1.64×	+1% to +6%
DiffRate (Chen et al., 2023)	40%	1.5×	–0.16%
VoCo-LLaMA (Ye et al., 18 Jun 2024)	576×	69.6% faster	83.7% retention
TokenCarve (Tan et al., 13 Mar 2025)	4.5×	1.23×	–1.54%

These data confirm both consistent speedup and minimal or even positive performance impact for well-designed compression schemes at aggressive token reduction ratios.

5. Engineering, Deployment, and Resource Considerations

State-of-the-art token compression is hardware-friendly and readily adaptable:

Matrix Multiplication Formulations: Approaches such as Prune and Merge (Mao et al., 30 Mar 2025) and Token Transforming (Zeng et al., 6 Jun 2025) are built upon sparse or block assignment matrices, leveraging efficient, accelerator-friendly operations for real-time deployment.
Training-Free and Plug-and-Play: Several frameworks (TokenCarve, Token Transforming, LaCo) are training-free and universally applicable, requiring no fine-tuning or retraining on specific architectures.
Layer-wise and Residual Architecture: Compressing tokens progressively within the encoder, as opposed to post-encoding, yields smoother representation flow and more faithful information preservation (Liu et al., 3 Jul 2025, Lu et al., 27 Mar 2025).

These designs enable deployment on edge devices, multi-view 3D and real-time applications (Zhang et al., 1 Sep 2024), and context-extending transformer systems for document, video, and audio reasoning (Xing et al., 2 Feb 2025).

6. Limitations, Challenges, and Future Research

Key challenges include:

Sensitivity in Compact ViTs: Token compression plug-ins designed for large models can severely degrade performance in structurally compact transformers due to misalignment of token distributions and pretrained weights, necessitating full retraining or tailored adaptation (Nguyen et al., 13 Jul 2025).
Information Loss under Extreme Compression: Even information-preserving frameworks risk performance drop under very high compression, especially for dense prediction or fine-grained recognition tasks.
Complexity of Compression Policy: Task- and context-aware or differentiable policies complicate the design and integration of selection and merging modules, particularly for joint image–video or multimodal use cases (Yang et al., 12 Dec 2024, Ye et al., 18 Jun 2024).

Active research directions involve developing unified, architecture-adaptive token compression frameworks, expanding end-to-end differentiable (gradient-based) assignment for all vision backbones, and creating adaptive, content-aware, and cross-modal compression policies that balance efficiency, flexibility, and robustness.

7. Summary and Outlook

Vision token compression has emerged as a crucial technology for efficient scaling of transformer-based models in computer vision and multimodal AI. State-of-the-art methods combine rigorous information/attention-driven scoring, flexible assignment via matrix transformation, and adaptive, often instruction- or task-aware, retention, to deliver order-of-magnitude acceleration with minimal or no loss in representation quality. The field is moving toward training-free, plug-and-play, and unified frameworks capable of sustaining both global semantic context and local detail, supporting widespread practical deployment and enabling further architectural innovation. As the breadth and complexity of vision–language tasks increase, token compression will play a foundational role in the next wave of scalable, high-performance, and resource-aware models in AI research and applications.