Papers
Topics
Authors
Recent
Search
2000 character limit reached

TokenCarve: Training-Free Visual Token Compression

Updated 15 February 2026
  • TokenCarve is an information-preserving framework that compresses visual tokens in multimodal LLMs without additional training.
  • It leverages SVD-based token information scores combined with attention metrics to retain critical features during compression.
  • Experimental results demonstrate minimal accuracy loss with significant inference speedup and reduced memory requirements.

TokenCarve refers to a training-free, information-preserving framework for visual token compression in multimodal LLMs (MLLMs) and can also indicate, more generically, a class of methods for segmenting binary or neural representations into semantic or statistically meaningful tokens based solely on intrinsic or statistical properties. In modern vision-language architectures, the computational and memory costs associated with large numbers of visual tokens present a major bottleneck, especially as prefill stages typically produce over 500 tokens per image—comprising the majority of self-attention workload. TokenCarve addresses this constraint by leveraging the alignment between information rank in the attention output and downstream performance, enabling substantial compression while preserving accuracy. In related, but distinct, contexts such as CAN bus reverse engineering, TokenCarve-style methods also describe unsupervised segmentation of raw bit streams, though without explicit supervision or semantic mapping.

1. Motivation and Problem Setting

Multimodal LLMs like LLaVA or QwenVL ingest both text and hundreds of visual tokens to perform tasks such as captioning or visual question answering. The quadratic scaling of self-attention and the predominance of visual tokens in the prefill stage lead to significant increases in inference latency, KV-cache memory requirements, and deployment cost. Existing training-based compressors (e.g., dynamic LLaVA, LLaVA-Mini) require substantial retraining for each MLLM backbone to achieve high compression ratios. Conversely, training-free approaches—those not requiring network modification—have historically relied on naïve selection metrics (e.g., raw attention scores), resulting in rapid accuracy degradation when more than half of the visual tokens are pruned (Tan et al., 13 Mar 2025).

TokenCarve introduces an empirically grounded, information-theoretic framework for visual token reduction, ensuring that pruned or merged tokens minimize critical information loss while requiring no additional network training. In other technical domains (e.g., automotive CAN analysis), the term may refer to fully unsupervised token segmentation without reliance on ground-truth signals (Verma et al., 2018).

2. Information Rank and Theoretical Insight

TokenCarve is based on the empirical observation that the performance of MLLMs under token compression closely mirrors the loss of "information quantity" in the attention output matrix for visual tokens. Specifically, as the number of retained visual tokens reduces, both task accuracy P(v)P(v) and the matrix rank, or singular value mass of the attention output block ZvisualLZ^L_{\rm visual}, experience coordinated decline.

Formally, the last layer attention output matrix for visual tokens is isolated as:

ZvisualLRLv×dZ^L_\text{visual} \in \mathbb{R}^{L_v \times d}

where LvL_v is the visual token count and dd the embedding dimension. The singular value spectrum {σi}\{\sigma_i\} of ZvisualLZ^L_\text{visual} encodes the retained information; its rank—or, equivalently, Shannon entropy or cumulative singular value mass—tracks prediction performance. A sharp drop in accuracy is observed at the inflection point where matrix rank collapses, typically when fewer than half the tokens are retained (see Fig. 3-1, (Tan et al., 13 Mar 2025)).

Preserving information in this context thus amounts to retaining tokens so that the rank and functional span of the attention output matrix are maximized, directly correlating with MLLM accuracy.

3. TokenCarve Algorithmic Framework

TokenCarve operates as a plug-and-play, two-stage compression module positioned between the early Transformer layers (typically between layers 2 and 3) in an MLLM. It processes off-the-shelf model attention features, requiring no architectural modifications nor auxiliary training. The key stages are:

Stage I: Information-Preserving Guided Selection (IPGS)

  1. SVD Decomposition: Compute SVD of the layer-2 attention output over visual tokens:

Zvisual2=UΣVT,Z^2_\text{visual} = U\Sigma V^T,

where Σ\Sigma contains singular values {σi}\{\sigma_i\}.

  1. Token Information Score: Each token’s contribution is

C(x)=iux,iσi,C(x) = \sum_i |u_{x,i} \sigma_i|,

where ux,iu_{x,i} are rows of UU.

  1. Attention Score: Each token’s mean attention score across heads:

S(x)=meanhAh,:,x2.S(x) = \mathrm{mean}_h A^2_{h,:,x}.

  1. Joint Ranking: Final ranking score per token:

E(x)=(1λ)C(x)+λS(x),E(x) = (1-\lambda) C(x) + \lambda S(x),

where λ[0,1]\lambda \in [0,1] balances information and attention.

  1. Pruning: Retain the top Lvc(1+ρ)L_{vc}(1+\rho) tokens by E(x)E(x), where LvcL_{vc} is the compression target and ρ\rho controls the merge slack.

Stage II: Information-Guided Merging

  1. Importance Splitting: Divide tokens into SetA (top half by EE), SetB (bottom half).
  2. Cross-Similarity: Compute normalized feature similarity between SetB and SetA:

SimiB×A=Norm(Z2)[SetB]Norm(Z2)[SetA]T\mathrm{Simi}_{B \times A} = \mathrm{Norm}(Z^2)[\mathrm{SetB}]\,\mathrm{Norm}(Z^2)[\mathrm{SetA}]^T

  1. Merging: For the top ρLvc(1+ρ)\rho L_{vc}(1+\rho) matched pairs, merge SetB tokens into SetA via averaging.
  2. Final Set: Remove merged copies, reaching exactly LvcL_{vc} tokens.

Complexity is dominated by the single SVD (O(min(Lv,d)Lvd)O(\min(L_v, d) L_v d)) and is negligible compared to downstream savings. No gradients or model adaptation are required.

4. Experimental Evaluation

TokenCarve was comprehensively evaluated on LLaVA-1.5-7B and 13B (CLIP-ViT+LLaMA backbones) across 11 diverse datasets. Key findings:

  • Reducing from 576 to 128 tokens (22.2% retention) yields:
    • 1.54% drop in average accuracy
    • 1.23× inference speedup (latency 0.911 s vs. 1.124 s)
    • 64% reduction in KV cache storage (down to 36% of baseline size).
  • For aggressive reduction to 64 tokens (11% retention):
    • 3% accuracy drop, 1.33× speedup, 73% KV reduction.
  • Ablation studies:
    • Attention-score only: 4.18% average accuracy drop.
    • Information-score only: 4.44%.
    • Joint IPGS (as in TokenCarve): 2.61% drop.
    • Optimal λ\lambda in E(x)E(x) lies near 0.4–0.6.
    • Merge ratio ρ\rho between 20–60% yields less than 1.5% variance in result.

Robustness is observed across different models, datasets, and merge/selection hyperparameters.

5. Best Practices and Limitations

Empirically, the principal determinant of accuracy retention is preservation of the attention output matrix’s rank or singular value spectrum. TokenCarve recommends:

  • λ\lambda in the range 0.4–0.6 for balancing information and attention.
  • Target token retention between 20–30% for significant speedup at minimal accuracy cost.
  • Merge slack ρ\rho ≈ 0.5.

TokenCarve is immediately deployable across ViT+Transformer architectures and does not require fine-tuning. However, practical speedup may be capped by overheads related to sparse positional encoding calculations and Python-level inefficiencies. Future optimizations include implementing sparse-index CUDA kernels and extending techniques to video or alternative input modalities (Tan et al., 13 Mar 2025).

6. Relation to Generic TokenCarve-Style Approaches

In contrast to the information-preserving visual token compression exemplified by TokenCarve in MLLMs, a more generic "TokenCarve" denotes any method that segments sequences (e.g., CAN or other time series) into contiguous tokens based on statistical variation or mutual information, typically in an unsupervised manner. For example, in automotive CAN reverse engineering (Verma et al., 2018):

  • Unsupervised TokenCarve methods use bit-flip entropy or transition probabilities for segmentation.
  • These approaches typically cluster tokens using their temporal/statistical profiles, without semantic validation or explicit ground-truth mapping.
  • They lack explicit translation mappings, do not guarantee semantic interpretability, and do not resolve token overlap with global scoring.
  • However, they require only passive capture, unlike diagnostically guided methods.

In this broad sense, “TokenCarve” serves as a conceptual umbrella for information-preserving, unsupervised, or semi-supervised token segmentation strategies where compression or interpretability are primary objectives.

7. Impact and Future Directions

TokenCarve demonstrates that combining SVD-based information assessment with attention-derived metrics enables effective, training-free compression of visual tokens, unlocking practical efficiency gains for inference and deployment at scale in MLLMs. Potential extensions include improved wall-clock acceleration, integration with low-level CUDA implementations for sparse attention, and adaptation to other high-dimensional modalities such as video. Generic TokenCarve-style compositionality also offers promising directions in unsupervised representation learning, though limitations remain around semantic interpretability and information retention without explicit supervision.

TokenCarve thus provides a paradigm for plug-and-play, information-theoretic token compression, with applications spanning both vision-LLMs and unsupervised data segmentation across other domains (Tan et al., 13 Mar 2025, Verma et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TokenCarve.