Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 186 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 65 tok/s Pro
Kimi K2 229 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

UniPruneBench: Visual Token Compression

Updated 10 November 2025
  • UniPruneBench is a unified benchmark for evaluating visual token compression in large multimodal models, addressing redundancy and computational overhead.
  • It standardizes evaluation protocols across diverse model families, algorithms, and tasks by reporting both task-specific and system-level performance metrics.
  • Empirical results reveal that even simple methods like random pruning yield competitive performance, highlighting trade-offs between efficiency gains and accuracy drops.

UniPruneBench is a unified and extensible benchmark for evaluating visual token compression strategies in large multimodal models (LMMs). Developed to address the inefficiency stemming from the high redundancy of visual tokens in modern multimodal architectures, UniPruneBench introduces rigorously standardized protocols, covers a broad set of model families, algorithms, and tasks, and reports both task-specific and system-level performance metrics. Its empirical findings provide foundational insights into the design and evaluation of efficient multimodal inference systems (Peng et al., 4 Nov 2025).

1. Motivation and Overview

LMMs, widely adopted for tasks such as visual question answering (VQA), multimodal reasoning, and grounding, process images by converting them into sequences of visual tokens—for example, using CLIP or CoCa-style Vision Transformers (ViTs) that produce hundreds of patch embeddings. Unlike textual tokens, these visual tokens are highly redundant: removing a substantial fraction yields only small accuracy losses for many downstream tasks. However, such redundancy results in significant computational overhead, including quadratic scaling of the attention mechanism, elevated memory consumption, and increased prefilling (context encoding) latency, which are particularly problematic for real-time or large-scale deployment.

Previous efforts to compress visual inputs via plug-and-play methods—such as pruning low-attention tokens or merging similar visual patches—have offered promising reductions in redundancy. However, these studies have been fragmented, lacking uniform task coverage, evaluation protocols, and consistent inclusion of system-level metrics such as latency and memory usage. UniPruneBench was introduced as a solution by providing a cohesive, reproducible framework for evaluating visual token compression within LMMs, enabling direct and fair comparison across algorithms, models, and tasks.

2. Benchmark Structure and Evaluation Protocol

Formal Definition and Metrics

Given an image xx, a frozen vision encoder (typically CLIP-ViT) generates a set of NtotalN_{\text{total}} visual tokens. A compression algorithm selects or merges tokens to output NretainedN_{\text{retained}} tokens, defining the pruning ratio as:

r=1NretainedNtotal,0r<1r = 1 - \frac{N_{\text{retained}}}{N_{\text{total}}}, \quad 0 \leq r < 1

System-level metrics include:

  • Total inference time TtotalT_{\text{total}} (seconds/batch)
  • Prefill time TprefillT_{\text{prefill}} (seconds/batch): the time required to encode vision and text before the first decoding step
  • Method time TmethodT_{\text{method}} (seconds/batch): overhead for importance scoring, selection, and token re-layout

Speedup relative to the uncompressed (vanilla) model is defined as:

Speedupprefill=TprefillvanillaTprefillpruned,Speeduptotal=TtotalvanillaTtotalpruned\text{Speedup}_{\text{prefill}} = \frac{T_{\text{prefill}}^{\text{vanilla}}}{T_{\text{prefill}}^{\text{pruned}}}, \quad \text{Speedup}_{\text{total}} = \frac{T_{\text{total}}^{\text{vanilla}}}{T_{\text{total}}^{\text{pruned}}}

Task Dimensions and Dataset Coverage

UniPruneBench evaluates six critical multimodal abilities across ten publicly available datasets, using standardized prompts and normalization:

Ability Dimension Representative Datasets
Comprehensive Understanding MME, MMBench
Mathematical Reasoning MathVista, Math-Vision
Optical Character Recognition SEED-Bench-2-Plus, OCRBench
Instruction Following MIA-Bench
Multidisciplinary Knowledge ScienceQA
Hallucination POPE, HallusionBench

Tasks cover open-ended QA, multiple-choice reasoning, free-form captioning, and OCR/text reading.

Evaluation Protocol

Reproducibility is established by fixing prompt templates, tokenization, seed control for stochastic algorithms (e.g., random pruning), pruning layer (K = 2 for Intra-LLM methods), and by normalizing all accuracy metrics to a [0, 100] scale. Pruning budgets are consistently reported at retention rates of 33.3%, 22.2%, and 11.1% (corresponding to r=66.7%r = 66.7\%, 77.8%77.8\%, 88.9%88.9\%), in addition to unpruned and lightly pruned baselines.

3. Compression Algorithms and Model Families

Algorithm Taxonomy

UniPruneBench integrates ten plug-and-play token compression methods, categorized by where pruning occurs:

  • ViT-Only (Vision-Side):
    • DivPrune: Diversity-maximizing subset selection.
    • G-Prune: Graph-propagation-based token importance.
    • LLaVA-PruMerge: Adaptive merging of similar patches in the CLIP encoder.
  • LLM-Only (Language-Side):
    • FastV: Removes low-attention tokens after LMM layer 2.
    • VTW: Discards tokens when attention saturates at deep layers.
    • FitPrune: Minimizes divergence in attention distributions.
    • DART: Selects pivots and removes redundant tokens.
  • Hybrid (Vision + Language):
    • SparseVLM: Rank-based adaptive sparsification with token recycling.
    • MustDrop: Stage-specific importance scores for encoding, prefill, decode.
  • Baseline: Uniform Random Pruning (both pre-LLM and intra-LLM).

Model Families

Three open-source LMM families are benchmarked, each in two parameter scales:

Model Family Parameter Sizes ViT Feature Source
LLaVA-v1.5 7B CLIP-style ViT
Intern-VL3 1B, 8B CLIP-style ViT
Qwen2.5-VL 3B, 7B CoCa-style ViT

All utilize a frozen ViT (≈576 tokens) and an MLP-based visual feature adapter before integration into the LLM context.

4. Task and System-Level Metrics

Task-Specific Evaluation

For each task, traditional metrics (e.g., exact match, F1, multi-choice accuracy) are normalized to a 0–100 range. UniPruneBench reports:

  • Absolute accuracy A(r)A(r) at pruning ratio rr
  • Relative degradation ΔA(r)=A(0)A(r)\Delta A(r) = A(0) - A(r)

Pruning performance is thus assessed not only in terms of final accuracy, but also its sensitivity to increasing redundancy removal.

System Metrics

Operational factors include total and prefill inference time, RAM/VRAM usage, and the method overhead TmethodT_{\text{method}}. All measurements use an NVIDIA A100 GPU (batch size = 1), averaged over three trials.

Empirical results show pruning methods incur negligible additional computation: method overhead Tmethod<0.5sT_{\text{method}} < 0.5s constitutes less than 0.12% of TtotalT_{\text{total}}.

5. Empirical Results

Performance of Random Pruning

Random pruning emerges as a robust baseline. For LLaVA-7B at r=66.7%r = 66.7\%, average accuracy drops by only 0.9 points versus no pruning, matching or outperforming algorithmic methods such as FitPrune (8.9-8.9 points) and VTW (31.6-31.6 points).

Method Avg. Acc. Drop vs. Vanilla
Vanilla 53.2
Random 52.3 0.9
FitPrune 44.3 8.9
VTW 21.6 31.6
DivPrune 49.0 4.2

No Ubiquitous Winner

No method demonstrates superiority across all tasks, models, and pruning ratios. DivPrune performs best on Intern-8B and Qwen under severe compression, while hybrid methods (e.g., SparseVLM, MustDrop) are optimal for moderate pruning on LLaVA.

Task Sensitivity to Pruning

Task robustness varies sharply:

  • Instruction-following tasks (MIA-Bench) are notably resilient, sometimes improving with pruning due to reduced visual distractions.
  • OCR benchmarks (SEED-Bench-2-Plus, OCRBench) are highly sensitive; removing >90% of tokens severely degrades performance due to loss of spatial resolution.

Accuracy–Efficiency Trade-Off

The pruning ratio rr is the dominant determinant of performance. Light pruning (r=66.7%r = 66.7\%) results in an average accuracy loss under 10%. Aggressive pruning (r=88.9%r = 88.9\%) causes drops of 20–40 points across benchmarks.

System Speedup

On Intern-VL3-8B with r=88.9%r = 88.9\% (MME):

Method TtotalT_{\text{total}} (s) TprefillT_{\text{prefill}} (s) Speedupprefill_{\text{prefill}} Speeduptotal_{\text{total}}
Vanilla 761.0 320.0 1.00 1.00
DivPrune 469.0 185.0 1.73 1.62
GPrune 454.0 167.0 1.92 1.68

Speedups are achieved proportionally to the pruning ratio, with insignificant method overhead.

6. Interpretations and Future Directions

UniPruneBench's standardized protocol permits identification of systemic phenomena and best practices in visual token compression:

  • Random and simple methods constitute strong baselines, challenging the incremental gains promised by complex, off-the-shelf heuristics.
  • There is no universally optimal strategy; both model scale and task type drastically influence pruning efficacy.
  • Task-aware strategies, especially those that preserve spatial density for OCR or adjust modality balance for instruction following, are especially promising.

Possible future directions include:

  • Integrating pruning with quantization in joint pipelines.
  • Development of learned, end-to-end adaptive token reducers fine-tuned for downstream task requirements.
  • Real-time, per-image adaptive determination of pruning ratio rr.
  • Extension to broader modalities, such as video, 3D point clouds, and retrieval.

This suggests that future progress in efficient multimodal modeling will hinge on standardized, extensible evaluation frameworks such as UniPruneBench, coupled with deeper, scale-aware analysis of compression methods. Open-source release of code, scripts, and implementations accompanies the benchmark, facilitating reproducibility and rapid advancement in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UniPruneBench.