UniPruneBench: Visual Token Compression

Updated 10 November 2025

UniPruneBench is a unified benchmark for evaluating visual token compression in large multimodal models, addressing redundancy and computational overhead.
It standardizes evaluation protocols across diverse model families, algorithms, and tasks by reporting both task-specific and system-level performance metrics.
Empirical results reveal that even simple methods like random pruning yield competitive performance, highlighting trade-offs between efficiency gains and accuracy drops.

UniPruneBench is a unified and extensible benchmark for evaluating visual token compression strategies in large multimodal models (LMMs). Developed to address the inefficiency stemming from the high redundancy of visual tokens in modern multimodal architectures, UniPruneBench introduces rigorously standardized protocols, covers a broad set of model families, algorithms, and tasks, and reports both task-specific and system-level performance metrics. Its empirical findings provide foundational insights into the design and evaluation of efficient multimodal inference systems (Peng et al., 4 Nov 2025).

1. Motivation and Overview

LMMs, widely adopted for tasks such as visual question answering (VQA), multimodal reasoning, and grounding, process images by converting them into sequences of visual tokens—for example, using CLIP or CoCa-style Vision Transformers (ViTs) that produce hundreds of patch embeddings. Unlike textual tokens, these visual tokens are highly redundant: removing a substantial fraction yields only small accuracy losses for many downstream tasks. However, such redundancy results in significant computational overhead, including quadratic scaling of the attention mechanism, elevated memory consumption, and increased prefilling (context encoding) latency, which are particularly problematic for real-time or large-scale deployment.

Previous efforts to compress visual inputs via plug-and-play methods—such as pruning low-attention tokens or merging similar visual patches—have offered promising reductions in redundancy. However, these studies have been fragmented, lacking uniform task coverage, evaluation protocols, and consistent inclusion of system-level metrics such as latency and memory usage. UniPruneBench was introduced as a solution by providing a cohesive, reproducible framework for evaluating visual token compression within LMMs, enabling direct and fair comparison across algorithms, models, and tasks.

2. Benchmark Structure and Evaluation Protocol

Formal Definition and Metrics

Given an image $x$ , a frozen vision encoder (typically CLIP-ViT) generates a set of $N_{\text{total}}$ visual tokens. A compression algorithm selects or merges tokens to output $N_{\text{retained}}$ tokens, defining the pruning ratio as:

$r = 1 - \frac{N_{\text{retained}}}{N_{\text{total}}}, \quad 0 \leq r < 1$

System-level metrics include:

Total inference time $T_{\text{total}}$ (seconds/batch)
Prefill time $T_{\text{prefill}}$ (seconds/batch): the time required to encode vision and text before the first decoding step
Method time $T_{\text{method}}$ (seconds/batch): overhead for importance scoring, selection, and token re-layout

Speedup relative to the uncompressed (vanilla) model is defined as:

$\text{Speedup}_{\text{prefill}} = \frac{T_{\text{prefill}}^{\text{vanilla}}}{T_{\text{prefill}}^{\text{pruned}}}, \quad \text{Speedup}_{\text{total}} = \frac{T_{\text{total}}^{\text{vanilla}}}{T_{\text{total}}^{\text{pruned}}}$

Task Dimensions and Dataset Coverage

UniPruneBench evaluates six critical multimodal abilities across ten publicly available datasets, using standardized prompts and normalization:

Ability Dimension	Representative Datasets
Comprehensive Understanding	MME, MMBench
Mathematical Reasoning	MathVista, Math-Vision
Optical Character Recognition	SEED-Bench-2-Plus, OCRBench
Instruction Following	MIA-Bench
Multidisciplinary Knowledge	ScienceQA
Hallucination	POPE, HallusionBench

Tasks cover open-ended QA, multiple-choice reasoning, free-form captioning, and OCR/text reading.

Evaluation Protocol

Reproducibility is established by fixing prompt templates, tokenization, seed control for stochastic algorithms (e.g., random pruning), pruning layer (K = 2 for Intra-LLM methods), and by normalizing all accuracy metrics to a [0, 100] scale. Pruning budgets are consistently reported at retention rates of 33.3%, 22.2%, and 11.1% (corresponding to $r = 66.7\%$ , $77.8\%$ , $88.9\%$ ), in addition to unpruned and lightly pruned baselines.

3. Compression Algorithms and Model Families

Algorithm Taxonomy

UniPruneBench integrates ten plug-and-play token compression methods, categorized by where pruning occurs:

ViT-Only (Vision-Side):
- DivPrune: Diversity-maximizing subset selection.
- G-Prune: Graph-propagation-based token importance.
- LLaVA-PruMerge: Adaptive merging of similar patches in the CLIP encoder.
LLM-Only (Language-Side):
- FastV: Removes low-attention tokens after LMM layer 2.
- VTW: Discards tokens when attention saturates at deep layers.
- FitPrune: Minimizes divergence in attention distributions.
- DART: Selects pivots and removes redundant tokens.
Hybrid (Vision + Language):
- SparseVLM: Rank-based adaptive sparsification with token recycling.
- MustDrop: Stage-specific importance scores for encoding, prefill, decode.
Baseline: Uniform Random Pruning (both pre-LLM and intra-LLM).

Model Families

Three open-source LMM families are benchmarked, each in two parameter scales:

Model Family	Parameter Sizes	ViT Feature Source
LLaVA-v1.5	7B	CLIP-style ViT
Intern-VL3	1B, 8B	CLIP-style ViT
Qwen2.5-VL	3B, 7B	CoCa-style ViT

All utilize a frozen ViT (≈576 tokens) and an MLP-based visual feature adapter before integration into the LLM context.

4. Task and System-Level Metrics

Task-Specific Evaluation

For each task, traditional metrics (e.g., exact match, F1, multi-choice accuracy) are normalized to a 0–100 range. UniPruneBench reports:

Absolute accuracy $A(r)$ at pruning ratio $r$
Relative degradation $\Delta A(r) = A(0) - A(r)$

Pruning performance is thus assessed not only in terms of final accuracy, but also its sensitivity to increasing redundancy removal.

System Metrics

Operational factors include total and prefill inference time, RAM/VRAM usage, and the method overhead $T_{\text{method}}$ . All measurements use an NVIDIA A100 GPU (batch size = 1), averaged over three trials.

Empirical results show pruning methods incur negligible additional computation: method overhead $T_{\text{method}} < 0.5s$ constitutes less than 0.12% of $T_{\text{total}}$ .

5. Empirical Results

Performance of Random Pruning

Random pruning emerges as a robust baseline. For LLaVA-7B at $r = 66.7\%$ , average accuracy drops by only 0.9 points versus no pruning, matching or outperforming algorithmic methods such as FitPrune ( $-8.9$ points) and VTW ( $-31.6$ points).

Method	Avg. Acc.	Drop vs. Vanilla
Vanilla	53.2	–
Random	52.3	0.9
FitPrune	44.3	8.9
VTW	21.6	31.6
DivPrune	49.0	4.2

No Ubiquitous Winner

No method demonstrates superiority across all tasks, models, and pruning ratios. DivPrune performs best on Intern-8B and Qwen under severe compression, while hybrid methods (e.g., SparseVLM, MustDrop) are optimal for moderate pruning on LLaVA.

Task Sensitivity to Pruning

Task robustness varies sharply:

Instruction-following tasks (MIA-Bench) are notably resilient, sometimes improving with pruning due to reduced visual distractions.
OCR benchmarks (SEED-Bench-2-Plus, OCRBench) are highly sensitive; removing >90% of tokens severely degrades performance due to loss of spatial resolution.

Accuracy–Efficiency Trade-Off

The pruning ratio $r$ is the dominant determinant of performance. Light pruning ( $r = 66.7\%$ ) results in an average accuracy loss under 10%. Aggressive pruning ( $r = 88.9\%$ ) causes drops of 20–40 points across benchmarks.

System Speedup

On Intern-VL3-8B with $r = 88.9\%$ (MME):

Method	$T_{\text{total}}$ (s)	$T_{\text{prefill}}$ (s)	Speedup $_{\text{prefill}}$	Speedup $_{\text{total}}$
Vanilla	761.0	320.0	1.00	1.00
DivPrune	469.0	185.0	1.73	1.62
GPrune	454.0	167.0	1.92	1.68

Speedups are achieved proportionally to the pruning ratio, with insignificant method overhead.

6. Interpretations and Future Directions

UniPruneBench's standardized protocol permits identification of systemic phenomena and best practices in visual token compression:

Random and simple methods constitute strong baselines, challenging the incremental gains promised by complex, off-the-shelf heuristics.
There is no universally optimal strategy; both model scale and task type drastically influence pruning efficacy.
Task-aware strategies, especially those that preserve spatial density for OCR or adjust modality balance for instruction following, are especially promising.

Possible future directions include:

Integrating pruning with quantization in joint pipelines.
Development of learned, end-to-end adaptive token reducers fine-tuned for downstream task requirements.
Real-time, per-image adaptive determination of pruning ratio $r$ .
Extension to broader modalities, such as video, 3D point clouds, and retrieval.

This suggests that future progress in efficient multimodal modeling will hinge on standardized, extensible evaluation frameworks such as UniPruneBench, coupled with deeper, scale-aware analysis of compression methods. Open-source release of code, scripts, and implementations accompanies the benchmark, facilitating reproducibility and rapid advancement in the field.

PDF Markdown Chat (Pro)

References (1)

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models (2025)

Follow Topic

Get notified by email when new papers are published related to UniPruneBench.