UniPruneBench: Visual Token Compression
- UniPruneBench is a unified benchmark for evaluating visual token compression in large multimodal models, addressing redundancy and computational overhead.
- It standardizes evaluation protocols across diverse model families, algorithms, and tasks by reporting both task-specific and system-level performance metrics.
- Empirical results reveal that even simple methods like random pruning yield competitive performance, highlighting trade-offs between efficiency gains and accuracy drops.
UniPruneBench is a unified and extensible benchmark for evaluating visual token compression strategies in large multimodal models (LMMs). Developed to address the inefficiency stemming from the high redundancy of visual tokens in modern multimodal architectures, UniPruneBench introduces rigorously standardized protocols, covers a broad set of model families, algorithms, and tasks, and reports both task-specific and system-level performance metrics. Its empirical findings provide foundational insights into the design and evaluation of efficient multimodal inference systems (Peng et al., 4 Nov 2025).
1. Motivation and Overview
LMMs, widely adopted for tasks such as visual question answering (VQA), multimodal reasoning, and grounding, process images by converting them into sequences of visual tokens—for example, using CLIP or CoCa-style Vision Transformers (ViTs) that produce hundreds of patch embeddings. Unlike textual tokens, these visual tokens are highly redundant: removing a substantial fraction yields only small accuracy losses for many downstream tasks. However, such redundancy results in significant computational overhead, including quadratic scaling of the attention mechanism, elevated memory consumption, and increased prefilling (context encoding) latency, which are particularly problematic for real-time or large-scale deployment.
Previous efforts to compress visual inputs via plug-and-play methods—such as pruning low-attention tokens or merging similar visual patches—have offered promising reductions in redundancy. However, these studies have been fragmented, lacking uniform task coverage, evaluation protocols, and consistent inclusion of system-level metrics such as latency and memory usage. UniPruneBench was introduced as a solution by providing a cohesive, reproducible framework for evaluating visual token compression within LMMs, enabling direct and fair comparison across algorithms, models, and tasks.
2. Benchmark Structure and Evaluation Protocol
Formal Definition and Metrics
Given an image , a frozen vision encoder (typically CLIP-ViT) generates a set of visual tokens. A compression algorithm selects or merges tokens to output tokens, defining the pruning ratio as:
System-level metrics include:
- Total inference time (seconds/batch)
- Prefill time (seconds/batch): the time required to encode vision and text before the first decoding step
- Method time (seconds/batch): overhead for importance scoring, selection, and token re-layout
Speedup relative to the uncompressed (vanilla) model is defined as:
Task Dimensions and Dataset Coverage
UniPruneBench evaluates six critical multimodal abilities across ten publicly available datasets, using standardized prompts and normalization:
| Ability Dimension | Representative Datasets |
|---|---|
| Comprehensive Understanding | MME, MMBench |
| Mathematical Reasoning | MathVista, Math-Vision |
| Optical Character Recognition | SEED-Bench-2-Plus, OCRBench |
| Instruction Following | MIA-Bench |
| Multidisciplinary Knowledge | ScienceQA |
| Hallucination | POPE, HallusionBench |
Tasks cover open-ended QA, multiple-choice reasoning, free-form captioning, and OCR/text reading.
Evaluation Protocol
Reproducibility is established by fixing prompt templates, tokenization, seed control for stochastic algorithms (e.g., random pruning), pruning layer (K = 2 for Intra-LLM methods), and by normalizing all accuracy metrics to a [0, 100] scale. Pruning budgets are consistently reported at retention rates of 33.3%, 22.2%, and 11.1% (corresponding to , , ), in addition to unpruned and lightly pruned baselines.
3. Compression Algorithms and Model Families
Algorithm Taxonomy
UniPruneBench integrates ten plug-and-play token compression methods, categorized by where pruning occurs:
- ViT-Only (Vision-Side):
- DivPrune: Diversity-maximizing subset selection.
- G-Prune: Graph-propagation-based token importance.
- LLaVA-PruMerge: Adaptive merging of similar patches in the CLIP encoder.
- LLM-Only (Language-Side):
- FastV: Removes low-attention tokens after LMM layer 2.
- VTW: Discards tokens when attention saturates at deep layers.
- FitPrune: Minimizes divergence in attention distributions.
- DART: Selects pivots and removes redundant tokens.
- Hybrid (Vision + Language):
- SparseVLM: Rank-based adaptive sparsification with token recycling.
- MustDrop: Stage-specific importance scores for encoding, prefill, decode.
- Baseline: Uniform Random Pruning (both pre-LLM and intra-LLM).
Model Families
Three open-source LMM families are benchmarked, each in two parameter scales:
| Model Family | Parameter Sizes | ViT Feature Source |
|---|---|---|
| LLaVA-v1.5 | 7B | CLIP-style ViT |
| Intern-VL3 | 1B, 8B | CLIP-style ViT |
| Qwen2.5-VL | 3B, 7B | CoCa-style ViT |
All utilize a frozen ViT (≈576 tokens) and an MLP-based visual feature adapter before integration into the LLM context.
4. Task and System-Level Metrics
Task-Specific Evaluation
For each task, traditional metrics (e.g., exact match, F1, multi-choice accuracy) are normalized to a 0–100 range. UniPruneBench reports:
- Absolute accuracy at pruning ratio
- Relative degradation
Pruning performance is thus assessed not only in terms of final accuracy, but also its sensitivity to increasing redundancy removal.
System Metrics
Operational factors include total and prefill inference time, RAM/VRAM usage, and the method overhead . All measurements use an NVIDIA A100 GPU (batch size = 1), averaged over three trials.
Empirical results show pruning methods incur negligible additional computation: method overhead constitutes less than 0.12% of .
5. Empirical Results
Performance of Random Pruning
Random pruning emerges as a robust baseline. For LLaVA-7B at , average accuracy drops by only 0.9 points versus no pruning, matching or outperforming algorithmic methods such as FitPrune ( points) and VTW ( points).
| Method | Avg. Acc. | Drop vs. Vanilla |
|---|---|---|
| Vanilla | 53.2 | – |
| Random | 52.3 | 0.9 |
| FitPrune | 44.3 | 8.9 |
| VTW | 21.6 | 31.6 |
| DivPrune | 49.0 | 4.2 |
No Ubiquitous Winner
No method demonstrates superiority across all tasks, models, and pruning ratios. DivPrune performs best on Intern-8B and Qwen under severe compression, while hybrid methods (e.g., SparseVLM, MustDrop) are optimal for moderate pruning on LLaVA.
Task Sensitivity to Pruning
Task robustness varies sharply:
- Instruction-following tasks (MIA-Bench) are notably resilient, sometimes improving with pruning due to reduced visual distractions.
- OCR benchmarks (SEED-Bench-2-Plus, OCRBench) are highly sensitive; removing >90% of tokens severely degrades performance due to loss of spatial resolution.
Accuracy–Efficiency Trade-Off
The pruning ratio is the dominant determinant of performance. Light pruning () results in an average accuracy loss under 10%. Aggressive pruning () causes drops of 20–40 points across benchmarks.
System Speedup
On Intern-VL3-8B with (MME):
| Method | (s) | (s) | Speedup | Speedup |
|---|---|---|---|---|
| Vanilla | 761.0 | 320.0 | 1.00 | 1.00 |
| DivPrune | 469.0 | 185.0 | 1.73 | 1.62 |
| GPrune | 454.0 | 167.0 | 1.92 | 1.68 |
Speedups are achieved proportionally to the pruning ratio, with insignificant method overhead.
6. Interpretations and Future Directions
UniPruneBench's standardized protocol permits identification of systemic phenomena and best practices in visual token compression:
- Random and simple methods constitute strong baselines, challenging the incremental gains promised by complex, off-the-shelf heuristics.
- There is no universally optimal strategy; both model scale and task type drastically influence pruning efficacy.
- Task-aware strategies, especially those that preserve spatial density for OCR or adjust modality balance for instruction following, are especially promising.
Possible future directions include:
- Integrating pruning with quantization in joint pipelines.
- Development of learned, end-to-end adaptive token reducers fine-tuned for downstream task requirements.
- Real-time, per-image adaptive determination of pruning ratio .
- Extension to broader modalities, such as video, 3D point clouds, and retrieval.
This suggests that future progress in efficient multimodal modeling will hinge on standardized, extensible evaluation frameworks such as UniPruneBench, coupled with deeper, scale-aware analysis of compression methods. Open-source release of code, scripts, and implementations accompanies the benchmark, facilitating reproducibility and rapid advancement in the field.