Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaVA-PruMerge: Efficient Visual Token Merging

Updated 20 January 2026
  • The paper demonstrates that LLaVA-PruMerge condenses 576 visual tokens into as few as 32, reducing FLOPs and prefill time with minimal accuracy loss.
  • It utilizes attention-based metrics and clustering to identify and merge redundant tokens while maintaining spatial coherence.
  • Implications include 2–6× speedups and 40–50% memory reduction, offering a scalable, efficient solution for LLM-augmented vision systems.

LLaVA-PruMerge is an adaptive visual token reduction module designed to improve the efficiency of large multimodal models (LMMs) by minimizing the computational and memory demands associated with visual tokens, while maintaining competitive performance on visual-language reasoning tasks. The technique is tailored for LLM-augmented vision architectures such as LLaVA ("Large Language and Vision Assistant"), seamlessly integrates with the CLIP-ViT pipeline, and provides a principled approach based on attention sparsity and clustering-informed merging (Shang et al., 2024, Hu et al., 2024).

1. Architectural Context and Integration

In LLaVA-class models, images are encoded using a frozen CLIP-ViT visual encoder that produces a sequence of patch tokens (typically 576 for standard inputs) and a [CLS] token. These embeddings are projected into the embedding space of a frozen LLM via a lightweight MLP. The visual token sequence serves as a prefix for the autoregressive LLM response.

LLaVA-PruMerge is inserted directly after the vision encoder and prior to the embedding projector for the LLM. Rather than passing all patch tokens, the module prunes a large fraction and strategically merges redundant tokens with the remaining informative ones. The output is a condensed prefix of m576m \ll 576 tokens, effectively compressing the visual context without discarding critical spatial information (Shang et al., 2024).

2. Quantification of Visual Token Informativeness

PruMerge quantifies token importance using attention-based metrics. Specifically, it examines the dot-product attention scores between the penultimate ViT [CLS] query qclsq_{cls} and each patch key kik_i:

acls=softmax(qclsKT/dk)Rn+1a_{cls} = \text{softmax}(q_{cls} K^T / \sqrt{d_k}) \in \mathbb{R}^{n+1}

Empirical sparsity is observed: most patch-to-[CLS] attention scores are near zero, with salient tokens exhibiting outlier values. Informative (unpruned) tokens are identified by applying the Interquartile Range (IQR) rule to select outliers:

  • Q1=Q_1 = 25th percentile, Q3=Q_3 = 75th percentile, IQR=Q3Q1\text{IQR} = Q_3-Q_1
  • lower_fence=Q11.5IQR\text{lower\_fence} = Q_1 - 1.5 \text{IQR}, upper_fence=Q3+1.5IQR\text{upper\_fence} = Q_3 + 1.5 \text{IQR}
  • Index set {i1,,im}\{i_1,\dots,i_m\} contains patch indices with aclsa_{cls} values outside the above range.

This adaptivity allows dynamic token selection proportional to image content complexity, with more visually busy images retaining a larger subset of tokens (Shang et al., 2024).

3. Clustering and Token Merging Mechanism

For tokens retained as most informative, PruMerge seeks to supplement their informational content by merging nearby pruned tokens. Token similarity is defined as:

Sim(i,j)=kikjT\text{Sim}(i, j) = k_i \cdot k_j^T

For each kept token index p{i1,,im}p \in \{i_1, \dots, i_m\}, the kk nearest neighbors among the pruned set are identified (by sorting similarity scores), denoted {j1,,jk}\{j_1, \dots, j_k\}. The merge step computes a new representation:

yp=q=1kacls[jq]yjqy_p' = \sum_{q=1}^{k} a_{cls}[j_q] \cdot y_{j_q}

The final token set Y={yi1,,yim}Y' = \{y_{i_1}', \dots, y_{i_m}'\} is fed to the LLM. The optional enhancement, PruMerge⁺, augments the selection with a small uniformly spaced sample of additional tokens prior to merging, mitigating minor accuracy drops (Shang et al., 2024). A plausible implication is that clustering-based merging maintains local spatial coherence while absorbing otherwise discarded peripheral information.

4. Computational Complexity and Efficiency Gains

In Transformer networks, computation costs grow quadratically with sequence length. By reducing the number of visual tokens from 576 to mm (average compression ratio \sim18×), LLaVA-PruMerge provides substantial savings:

  • FLOPs reduction is approximately (m/576)2(m/576)^2
  • Empirical measurements on V100 GPU:
    • LLaVA-1.5 (576 tokens): 9.3T FLOPs, 88.6 ms prefill, 23.3 GB memory
    • LLaVA-1.5 + PruMerge (\sim40 tokens): 0.91T FLOPs, 15.3 ms prefill, 13.7 GB memory

Table: Token Reduction Impact on Inference Metrics

Configuration Visual Tokens Prefill Time (ms) Peak Memory (GB)
Baseline LLaVA-1.5 576 88.6 23.3
LLaVA-1.5 w/ PruMerge ~40 15.3 13.7

Prefill speedup is 6×\sim6\times and memory consumption is reduced by 40%\sim40\% (Shang et al., 2024). In the iLLaVA extension, parallel merging of hundreds of tokens at multiple image-encoder and LLM layers further halves the overall throughput time and peak GPU memory usage, with the cost of merging being negligible (O(Nd)O(Nd) vs. O(N2d)O(N^2d) for attention) (Hu et al., 2024).

5. Empirical Performance and Trade-Offs

Compression ratios achieved by PruMerge are significant: 576→\sim32 tokens (5.5% retained) for the full module, and 576→\sim144 tokens (25% retained) for PruMerge⁺. Across multiple benchmarks (VQAv2, ScienceQA, TextVQA, POPE, MME, MMBench), full PruMerge incurs a minor accuracy drop (1–5 points absolute), whereas PruMerge⁺ nearly eliminates the performance gap while providing 2–3× efficiency gains.

Examples (Vicuna-7B and Vicuna-13B):

Benchmark Baseline +PruMerge⁺ (25%)
VQAv2 78.5 76.8
ScienceQA 66.8 68.3
TextVQA 58.2 57.1
POPE 85.9 84.0
MME 1510.7 1462.4

A plausible implication is that nearly all visual reasoning capability is preserved under strong compression, with the optional uniform-sampling step (PruMerge⁺) closing the accuracy gap (Shang et al., 2024).

The principle underlying PruMerge—leveraging attention-based metrics to identify redundancy, followed by context-preserving token merging—is extensible to a variety of LVLMs. The iLLaVA method (Hu et al., 2024) demonstrates one-step weighted merging at both the image encoder and LLM levels, with empirical results showing throughput nearly doubled and memory costs halved, while the maximum accuracy loss remains <0.5%<0.5\% across models and tasks. The technique is compatible with models of varying size and architecture and demonstrates generalizability over single-image, multi-image, and video settings.

Visualizations in both works indicate that post-merging token maps correspond closely to salient image objects, regions containing text, or other semantically relevant parts, confirming attention as a reliable redundancy indicator. Layerwise profiling shows balanced time savings across both visual and language components.

7. Concluding Perspective

LLaVA-PruMerge embodies a robust, adaptive token reduction and merging strategy for LVLMs. By concretely quantifying and exploiting attention-based sparsity in patch tokens, clustering informative regions, and recycling information from pruned tokens into merged embeddings, the method achieves 2–6× prefill/inference speedup, 40–50% reduction in GPU memory usage, and near-baseline performance on diverse benchmarks. This suggests considerable future potential for scalable multimodal reasoning in resource-constrained environments and offers a model-agnostic solution for efficient LVLM deployment (Shang et al., 2024, Hu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaVA-PruMerge.