LLaVA-PruMerge: Efficient Visual Token Merging
- The paper demonstrates that LLaVA-PruMerge condenses 576 visual tokens into as few as 32, reducing FLOPs and prefill time with minimal accuracy loss.
- It utilizes attention-based metrics and clustering to identify and merge redundant tokens while maintaining spatial coherence.
- Implications include 2–6× speedups and 40–50% memory reduction, offering a scalable, efficient solution for LLM-augmented vision systems.
LLaVA-PruMerge is an adaptive visual token reduction module designed to improve the efficiency of large multimodal models (LMMs) by minimizing the computational and memory demands associated with visual tokens, while maintaining competitive performance on visual-language reasoning tasks. The technique is tailored for LLM-augmented vision architectures such as LLaVA ("Large Language and Vision Assistant"), seamlessly integrates with the CLIP-ViT pipeline, and provides a principled approach based on attention sparsity and clustering-informed merging (Shang et al., 2024, Hu et al., 2024).
1. Architectural Context and Integration
In LLaVA-class models, images are encoded using a frozen CLIP-ViT visual encoder that produces a sequence of patch tokens (typically 576 for standard inputs) and a [CLS] token. These embeddings are projected into the embedding space of a frozen LLM via a lightweight MLP. The visual token sequence serves as a prefix for the autoregressive LLM response.
LLaVA-PruMerge is inserted directly after the vision encoder and prior to the embedding projector for the LLM. Rather than passing all patch tokens, the module prunes a large fraction and strategically merges redundant tokens with the remaining informative ones. The output is a condensed prefix of tokens, effectively compressing the visual context without discarding critical spatial information (Shang et al., 2024).
2. Quantification of Visual Token Informativeness
PruMerge quantifies token importance using attention-based metrics. Specifically, it examines the dot-product attention scores between the penultimate ViT [CLS] query and each patch key :
Empirical sparsity is observed: most patch-to-[CLS] attention scores are near zero, with salient tokens exhibiting outlier values. Informative (unpruned) tokens are identified by applying the Interquartile Range (IQR) rule to select outliers:
- 25th percentile, 75th percentile,
- ,
- Index set contains patch indices with values outside the above range.
This adaptivity allows dynamic token selection proportional to image content complexity, with more visually busy images retaining a larger subset of tokens (Shang et al., 2024).
3. Clustering and Token Merging Mechanism
For tokens retained as most informative, PruMerge seeks to supplement their informational content by merging nearby pruned tokens. Token similarity is defined as:
For each kept token index , the nearest neighbors among the pruned set are identified (by sorting similarity scores), denoted . The merge step computes a new representation:
The final token set is fed to the LLM. The optional enhancement, PruMerge⁺, augments the selection with a small uniformly spaced sample of additional tokens prior to merging, mitigating minor accuracy drops (Shang et al., 2024). A plausible implication is that clustering-based merging maintains local spatial coherence while absorbing otherwise discarded peripheral information.
4. Computational Complexity and Efficiency Gains
In Transformer networks, computation costs grow quadratically with sequence length. By reducing the number of visual tokens from 576 to (average compression ratio 18×), LLaVA-PruMerge provides substantial savings:
- FLOPs reduction is approximately
- Empirical measurements on V100 GPU:
- LLaVA-1.5 (576 tokens): 9.3T FLOPs, 88.6 ms prefill, 23.3 GB memory
- LLaVA-1.5 + PruMerge (40 tokens): 0.91T FLOPs, 15.3 ms prefill, 13.7 GB memory
Table: Token Reduction Impact on Inference Metrics
| Configuration | Visual Tokens | Prefill Time (ms) | Peak Memory (GB) |
|---|---|---|---|
| Baseline LLaVA-1.5 | 576 | 88.6 | 23.3 |
| LLaVA-1.5 w/ PruMerge | ~40 | 15.3 | 13.7 |
Prefill speedup is and memory consumption is reduced by (Shang et al., 2024). In the iLLaVA extension, parallel merging of hundreds of tokens at multiple image-encoder and LLM layers further halves the overall throughput time and peak GPU memory usage, with the cost of merging being negligible ( vs. for attention) (Hu et al., 2024).
5. Empirical Performance and Trade-Offs
Compression ratios achieved by PruMerge are significant: 576→32 tokens (5.5% retained) for the full module, and 576→144 tokens (25% retained) for PruMerge⁺. Across multiple benchmarks (VQAv2, ScienceQA, TextVQA, POPE, MME, MMBench), full PruMerge incurs a minor accuracy drop (1–5 points absolute), whereas PruMerge⁺ nearly eliminates the performance gap while providing 2–3× efficiency gains.
Examples (Vicuna-7B and Vicuna-13B):
| Benchmark | Baseline | +PruMerge⁺ (25%) |
|---|---|---|
| VQAv2 | 78.5 | 76.8 |
| ScienceQA | 66.8 | 68.3 |
| TextVQA | 58.2 | 57.1 |
| POPE | 85.9 | 84.0 |
| MME | 1510.7 | 1462.4 |
A plausible implication is that nearly all visual reasoning capability is preserved under strong compression, with the optional uniform-sampling step (PruMerge⁺) closing the accuracy gap (Shang et al., 2024).
6. Broader Applicability and Related Methods
The principle underlying PruMerge—leveraging attention-based metrics to identify redundancy, followed by context-preserving token merging—is extensible to a variety of LVLMs. The iLLaVA method (Hu et al., 2024) demonstrates one-step weighted merging at both the image encoder and LLM levels, with empirical results showing throughput nearly doubled and memory costs halved, while the maximum accuracy loss remains across models and tasks. The technique is compatible with models of varying size and architecture and demonstrates generalizability over single-image, multi-image, and video settings.
Visualizations in both works indicate that post-merging token maps correspond closely to salient image objects, regions containing text, or other semantically relevant parts, confirming attention as a reliable redundancy indicator. Layerwise profiling shows balanced time savings across both visual and language components.
7. Concluding Perspective
LLaVA-PruMerge embodies a robust, adaptive token reduction and merging strategy for LVLMs. By concretely quantifying and exploiting attention-based sparsity in patch tokens, clustering informative regions, and recycling information from pruned tokens into merged embeddings, the method achieves 2–6× prefill/inference speedup, 40–50% reduction in GPU memory usage, and near-baseline performance on diverse benchmarks. This suggests considerable future potential for scalable multimodal reasoning in resource-constrained environments and offers a model-agnostic solution for efficient LVLM deployment (Shang et al., 2024, Hu et al., 2024).