Papers
Topics
Authors
Recent
Search
2000 character limit reached

TokenFusion for Vision Transformers

Updated 16 April 2026
  • TokenFusion is a set of techniques that merge similar visual tokens to reduce the quadratic complexity of self-attention in Vision Transformers.
  • It implements methods like ToFu and MCTF, which use similarity measures and weighted pooling to preserve important features while decreasing token count.
  • Empirical results show that these fusion strategies significantly lower computational overhead and memory usage while improving accuracy in both unimodal and multimodal models.

TokenFusion for Vision Transformers encompasses a collection of methodologies designed to mitigate the quadratic computational and memory overhead inherent in Vision Transformer (ViT) architectures by merging—or otherwise reducing—the number of visual tokens processed in self-attention layers. Originating from the observation that many image patches encode highly redundant information, TokenFusion strategies aim to retain salient features while substantially downsizing the token sequence, thereby enabling more efficient inference and training without eroding, and sometimes improving, downstream performance. Solutions span visual, multi-modal, and cross-architecture contexts, with recent flagship designs such as ToFu, Multi-Criteria Token Fusion (MCTF), and multimodal fusion blocks achieving notable results in both unimodal and large multimodal models (LMMs).

1. Motivation and Principles of TokenFusion

Vision Transformers divide images into patches, each projected to an embedding vector (token), and then apply multi-head self-attention. Unlike text tokens, neighboring image patches often encode visually similar regions (e.g., sky, water, background), yielding many collinear or nearly identical tokens in the feature space. The resulting redundancy drives high computational cost, as the attention mechanism requires O(T2)\mathcal{O}(T^2) operations for TT tokens per forward pass. In multi-image and multi-modal settings, this overhead can be prohibitive—e.g., six high-resolution images may yield over 15,000 tokens, quickly outstripping GPU resources during inference or training (Pippi et al., 6 Mar 2025).

TokenFusion approaches merge tokens based on various criteria of semantic or statistical similarity, aiming to:

  • Reduce redundant computation: By fusing tokens, the effective attention sequence length KK is much less than the original MM, decreasing attention FLOPs by over 75% when halving tokens.
  • Preserve critical information: Unlike naive pruning (which discards tokens), careful fusion preserves average or cluster-representative signals, retaining fine detail when necessary.
  • Maintain architectural flexibility: The best designs (notably ToFu) operate independently of ViT backbone, support multi-modal inputs, and can function without retraining the underlying model (Pippi et al., 6 Mar 2025, Lee et al., 2024, Kim et al., 2023).

2. Formal Frameworks and Algorithms

Multiple TokenFusion frameworks have been proposed, differing primarily in how similarity is measured and how tokens are combined. The following table summarizes key algorithms:

Method Similarity Metric Fusion Mechanism
ToFu (Pippi et al., 6 Mar 2025) Cosine similarity s(v,u)=vuvus(v,u)=\frac{v\cdot u}{\|v\|\|u\|} Running average per cluster, threshold-based sequential assignment
MCTF (Lee et al., 2024) Multi-criteria (Similarity, Informativeness, Size) Gradual, bidirectional matching with weighted-sum pooling
ToFu-MLERP (Kim et al., 2023) Cosine similarity, max-weight matching Hybrid: "keep-stronger" in early blocks, Riemannian interpolation (MLERP) in late blocks

ToFu Sequential Fusion

Given MM visual tokens V=[v1,,vM]V=[v_1,\ldots,v_M], ToFu sequentially processes vmv_m:

  1. Find j=arg maxks(vm,tk)j = \operatorname{arg\,max}_k s(v_m,t_k) over current fused set TT.
  2. If TT0 (threshold), fuse:

TT1

  1. Else, add TT2 as a new token (Pippi et al., 6 Mar 2025).

Multi-Criteria Token Fusion (MCTF)

MCTF (Lee et al., 2024) introduces a composite attraction score TT3, where:

  • TT4 - Cosine similarity.
  • TT5 - Inverse informativeness based on "one-step-ahead" attention (future attention weights).
  • TT6 - Inverse product of sizes, regularizing fusion of many tokens into one.

Gradual bipartite matching fuses a fixed number of token pairs per layer, with weighted-sum pooling aggregating values.

Hybrid Fusion and MLERP

ToFu (Kim et al., 2023) argues that pure averaging can collapse feature norms. It proposes:

  • Early layers (high nonlinearity): Pruned fusion, retaining the stronger token based on score/norm.
  • Late layers (quasi-linear): Merge by arithmetic average or MLERP (spherical linear interpolation):

TT7

where TT8. This preserves the average norm and mitigates distributional shift.

3. Integration into Vision Transformers and LMMs

TokenFusion modules are inserted post-encoder but pre-LLM (for LMMs) or between transformer blocks (for unimodal ViTs):

  • ViT + ToFu: Output token sequence from the ViT encoder is fused, then the token sequence (plus optional text prompt in LMM) is input to the multimodal transformer.
  • MCTF within ViT: Gradual token fusion modules are inserted between standard transformer blocks, allowing dynamic adaptability.
  • Multimodal Fusion: In multimodal models, TokenFusion can selectively substitute uninformative tokens from one modality with projected/aggregated counterparts from another, optionally preserving positional alignment via residual addition (Wang et al., 2022).

Positional embeddings for fused tokens are generally inherited from one of their constituents (e.g., the first-patch position). The entire fusion process remains parameter-free in post-hoc ToFu, ensuring compatibility with pre-trained encoders (Pippi et al., 6 Mar 2025, Kim et al., 2023).

4. Empirical Performance and Benchmark Results

Significant gains have been observed across multi-image reasoning, classification, and generation tasks.

  • LLaVA-Interleave (9 tasks, 2–6 images each, InternVL2-4B backbone):
    • ToFu reduces token usage to 41% with a 1.52% accuracy improvement (33.79 % → 35.31 %), 66% memory savings, and marginal runtime increase.
    • Outperforms random sampling and HiRED for equivalent token budgets.
  • ComPairs (average 14.6k tokens/sample):
    • On InternVL2-8B, ToFu approximately doubles absolute accuracy (9.02 % → 23.68 %) while reducing token count by over half.
    • Across all backbones, ToFu yields 2–14 percentage point gains with halved memory.
  • MCTF (DeiT-T, r=16): 44% FLOPs reduction, +0.5% Top-1 accuracy vs. baseline (72.7% vs. 72.2%).
  • ToFu-MLERP/AVG (ViT-B): For r=12 tokens/layer, achieves 82.46% Top-1 (MLERP), surpassing ToMe (81.82%) and other state-of-the-art token-reduction techniques at equivalent or better throughput.
  • Generalization: Comparable FLOPs-accuracy gains proven across T2T-ViT, LV-ViT, and large-scale conditional generation (Stable Diffusion), emphasizing method orthogonality to backbone and task.

5. Sensitivity, Ablation Studies, and Limitations

Extensive ablations reveal the following:

  • Fusion threshold TT9: Lower KK0 increases fusion (fewer tokens, more risk of losing detail), while higher KK1 retains more tokens. Dynamically adjusting KK2 by token count (0.9→0.7 interpolated) approaches the accuracy-token Pareto frontier (Pippi et al., 6 Mar 2025).
  • Criterion weighting (MCTF): Synergistic use of similarity, informativeness, and size regularization outperforms single-criterion fusion; bidirectional matching and weighted pooling further enhance results by ≤0.3% (Lee et al., 2024).
  • Fusion mechanism: MLERP corrects for norm shrinkage and yields small (0.03–0.5%) but consistent accuracy gains over arithmetic averaging (Kim et al., 2023).
  • Residual positional alignment (multimodal): Retaining positional embeddings during token substitution recovers 1–2% additional segmentation/detection performance (Wang et al., 2022).
  • Limitations: Sequential assignment can miss globally optimal groupings (“local” redundancy may not capture all structure). Excessive fusion risks information loss around small/high-detail objects. Some fusion designs add modest overhead in similarity computation and matching, though overhead is negligible relative to saved FLOPs for large KK3 (Pippi et al., 6 Mar 2025, Kim et al., 2023).

6. Extensions, Applications, and Future Directions

Research continues to extend TokenFusion in several directions:

  • Hierarchical and intra-block fusion: Placing fusion modules at intermediate transformer layers or even inside ViT encoder blocks.
  • Auto-tuning: Learning per-layer or per-sample fusion rates via lightweight controllers, using attention scores or other importance signals for selection.
  • Incorporation of cross-modal/multi-modal fusion: Supplying uninformative tokens from one modality with those from another, with learned or geometric correspondence and residual alignment—a design yielding gains in semantic segmentation, image translation, and 3D+2D object detection (Wang et al., 2022).
  • Hybrid pruning-merging approaches: Dynamically interpolating between pruning and merging based on depth, leveraging approximate linearity in feature manifolds at higher layers ("ToFu hybrid") (Kim et al., 2023).
  • TokenFusion in CNN-transformer hybrids: Early, late, and per-layer fusion of CNN and transformer token streams has demonstrated improved accuracy for classification tasks, with Layer-by-Layer fusion providing state-of-the-art (87.77% Top-1 on ImageNet-1K) (Choi et al., 2022).

7. Comparative Summary and Impact

TokenFusion has emerged as a critical component for making Vision Transformers tractable and performant—especially in scenarios with high-resolution or multi-modal inputs. Key advantages include:

  • Memory and compute savings: 60–70% reduction in token count translates to >70% lower memory and quadratic compute for attention.
  • Task-agnostic adaptability: Post-hoc methods like ToFu require no retraining, and fusion blocks are compatible with arbitrary ViT variants and LMMs.
  • State-of-the-art trade-offs: Systematic benchmarking shows TokenFusion outperforms both pruning-only and averaging-only baselines in both efficiency and downstream accuracy for classification, reasoning, and generative tasks (Pippi et al., 6 Mar 2025, Lee et al., 2024, Kim et al., 2023, Choi et al., 2022, Wang et al., 2022).

Collectively, these findings underscore TokenFusion as a principled, extensible, and practically impactful paradigm for optimizing the performance of ViT-based models across contemporary computer vision pipelines.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TokenFusion for Vision Transformers.