- The paper presents an adaptive two-stage token pruning approach using image entropy and dynamic layer selection to efficiently reduce redundant visual tokens.
- It demonstrates high accuracy retention (up to 89.8%) at aggressive pruning ratios, significantly lowering latency and memory usage across various VLM architectures.
- The method is hardware-agnostic and robust to architectural changes, ensuring practical deployment in real-world, high-resolution inference scenarios.
ERASE: Adaptive Two-Stage Vision Token Pruning for Efficient VLM Inference
Motivation and Background
The latest generation of Vision-LLMs (VLMs) integrate advanced LLMs with high-resolution visual input processing, yielding substantial gains in multimodal tasks. However, this integration accompanied a quadratic scaling in the number of visual tokens as image resolution increases, dramatically raising computational overhead, particularly for high-resolution images. The attention mechanisms in LLMs, with complexity dependent on input sequence length, exacerbate inference latency and memory usage as visual sequence lengths grow.
Existing token pruning methods focus on single-objective (either vision-only redundancy or text-conditioned relevance) or hybrid approaches. Most suffer from one or more of the following limitations:
- Reliance on semantic features from internal latent representations, limiting device interoperability and failing to exploit raw-image redundancy.
- Use of fixed, static token pruning ratios or locations, unable to adapt to varied image complexity.
- High memory overheads from storing or computing attention maps, especially for high-resolution inputs.
These limitations motivate the need for a framework that adaptively compresses vision tokens in a hardware- and architecture-agnostic manner, balancing aggressive compression with preservation of functional accuracy.
ERASE Framework Overview
ERASE (Eliminating Redundant visual tokens via Adaptive two-StagE token pruning) introduces an adaptive, two-stage pipeline, aimed at maximizing token pruning while preserving multimodal model efficacy.
- Stage 1 โ Image-Level Entropy-Based Pruning:
Visual redundancy is directly estimated from the raw image by computing the entropy of each image patch. Low-entropy patches, which indicate redundant or homogeneous regions, are aggressively pruned. By quantifying global image complexity as the median local patch entropy, ERASE adapts the Stage 1 pruning ratio using Bayesian-optimized thresholds for discrete complexity levels. This approach removes visually redundant regions before visual tokens even enter the VLM backbone, eliminating both computation and memory for these tokens in downstream layers.
- Stage 2 โ Context-Aware Dynamic Layer Pruning:
To further condense the token set, surviving visual tokens undergo text-conditioned relevance estimation via cross-attention with the input instruction. Uniquely, ERASE selects the pruning layer dynamically: simple scenes (low-entropy) are pruned at early decoder layers, while complex images invoke mid-to-late layer pruning. This strategy tailors token retention to the specific requirements of multimodal reasoning depth per input, established empirically by analyzing accuracy drops from pruning at various layers as a function of image entropy.
The combination of these stages achieves strong compression of vision tokens with minimal semantic loss, regardless of image complexity or model architecture.
Experimental Results and Quantitative Analysis
ERASE is evaluated under extreme compression scenarios across recent VLMs (Qwen2.5-VL-7B, Qwen3-VL-8B, and InternVL3-8B) and multiple challenging benchmarks, especially those demanding fine-grained visual reasoning.
Accuracy Retention:
At an 85% token pruning ratio, ERASE retains approximately 89.5% of the original model accuracy on complex multibenchmarks, compared to the best prior method's 78.2%. On Qwen3-VL-8B, accuracy retention is similarly highโ89.8% with ERASE versus 80.4% for the strongest baseline. Importantly, on document and text-centric tasks (OCRBench, TextVQA, InfoVQA), ERASE consistently demonstrates lower accuracy degradation than previous methods, establishing the superiority of its adaptive redundancy removal both in vision-only and context-aware components.
Efficiency Gains:
In high-resolution (4K) settings with ~16k vision tokens, ERASE reduces end-to-end task latency by up to 1.56ร versus the base model. It also achieves maximal reduction in KV cache size (by 6.57ร), outperforming all comparison baselines. Unlike prior methods, which either defer pruning to deep layers (losing efficiency) or rely on heavy attention map computations, ERASEโs lightweight entropy-based first stage and retrospective KV cache pruning provide real-world latency and memory advantages.
Robustness and Generality:
Ablations demonstrate that ERASE's performance is robust to architectural changes and model scales. Bayesian-optimized threshold configurations for one architecture transfer effectively to similar models, obviating expensive configuration search for every deployment. Additionally, ERASE maintains high performance even with aggressive early-stage pruning, a regime under which alternatives like CDPruner, PruneSID, and iLLaVA incur catastrophic accuracy loss.
Methodological Contributions
ERASE's methodological innovations include:
- Entropy-Guided, Input-Adaptive Pruning: Directly measures and employs raw image entropy, enabling redundancy removal uninfluenced by encoder semantics or attention distributions, and optimally adapts to image heterogeneity.
- Dynamic Pruning Layer Selection: Links global image complexity to the hierarchical depth required for satisfactory multimodal reasoning, improving the efficiency-accuracy trade-off by pruning tokens at the earliest possible layer consistent with task requirements.
- Architecture-Agnostic Design: Removes dependencies on global [CLS] tokens (as needed in earlier CLIP-based methods) or proprietary encoder semantics, allowing application to VLMs employing diverse visual tokenization and tiling mechanisms.
Theoretical and Practical Implications
On the theoretical front, ERASE establishes a formal link between low-level information-theoretic measures (entropy) and optimal pruning strategies for complex multimodal models. It empirically validates that early, adaptive removal based on raw input statistics can outperform sophisticated internal feature-based redundancy estimators. Furthermore, by decoupling visual and context-aware token preservation, it highlights the importance of multimodal token dynamics and optimal propagation depth for varied downstream task intricacy.
From a practical perspective, ERASE provides immediate utility in production VLM deployments, especially for latency- or memory-constrained scenarios or high-throughput inference. Its hardware-agnosticity also facilitates broad adoption across heterogeneous accelerator architectures.
Future Directions
Potential avenues for future research, motivated by ERASEโs findings, include:
- Integrating content-aware spatial attention in image patch selection, further refining entropy estimates in structurally non-uniform domains.
- Extending adaptive token pruning to streaming and video settings, where temporal redundancy estimation could further reduce multimodal sequence length.
- Joint training or fine-tuning of pruning modules end-to-end with VLMs to maximize semantic preservation, rather than relying on post-hoc Bayesian configuration.
- Investigating cross-modal entropy propagation for earlier identification of text-grounded but visually subtle tokens.
Conclusion
ERASE presents a significant advancement in vision token pruning for VLMs through its adaptive, entropy-driven, two-stage mechanism for redundancy elimination and contextually-aware token retention. It achieves demonstrably higher accuracy retention and real-world efficiency gains than all prior baselines, with strong robustness across VLM architectures and task domains. These contributions make ERASE a compelling approach for building practical, scalable, and universally deployable vision-language systems (2605.09982).