Multimodal Token Fusion

Updated 15 March 2026

Multimodal token fusion is a suite of deep learning methods that integrates modality-specific tokens into a unified representation, reducing redundancy and computational cost.
It employs strategies such as spatial aggregation, dynamic token reduction, and channel fusion to optimize integration for tasks like vision-language reasoning and audio-visual modeling.
Recent approaches show significant efficiency gains with up to 60% token reduction and improved accuracy, while preserving critical semantic details across scales.

Multimodal token fusion is a suite of algorithmic methods in deep learning designed to integrate heterogeneous modality-specific token representations—typically sequences of embeddings from vision, language, audio, or other modalities—into a unified, information-rich embedding space suitable for downstream multimodal reasoning tasks. This domain spans spatial, channel, and dynamic routing strategies for reducing redundancy, compressing information, and maximizing inter-modal synergy while controlling computational costs. Recent advances have focused on both efficient token reduction and expressive cross-modal fusion at scale, enabling state-of-the-art results across vision-LLMs, audio-visual modeling, medical data integration, and instruction-aware sequential tasks.

1. Fundamental Principles and Motivations

The core motivation for multimodal token fusion arises from the need to efficiently and effectively utilize multiple input modalities, particularly as large multimodal models (LMMs) and vision-LLMs (VLMs) inherit significant computational burdens from both quadratic self-attention complexity and the sheer number of modality-specific tokens (e.g., image patches, word pieces). Token fusion directly addresses:

Spatial and semantic redundancy: Large patch-based encoders (e.g., ViT) produce many highly similar tokens. Naively processing them through a transformer incurs superfluous computation (Tang et al., 8 Jun 2025).
Cross-modal information integration: Effective fusion must aggregate and align the salient information of text, image, audio, or sensor data to allow for robust reasoning (Georgiou et al., 15 Apr 2025, Aladago et al., 2022).
Scalability: With long visual contexts (multi-image, high-resolution), token pruning or fusion becomes essential for practical inference (Pippi et al., 6 Mar 2025).

These principles manifest in several architectural and algorithmic approaches, including spatial aggregation, dynamic pruning, cross-attentional fusion, channel fusion, token distillation, and adaptive expert routing.

2. Token Reduction: Spatial Fusion and Redundancy Compression

Spatial-oriented token fusion approaches target the reduction of vision token count prior to cross-modal interaction, thus mitigating the quadratic cost of transformer-based attention.

Spatial Token Fusion (STF) (Tang et al., 8 Jun 2025): Aggregates each $k \times k$ neighborhood of vision tokens into a single, higher-dimensional token; with $k=2$ , $E=1$ , this gives a $4\times$ reduction (i.e., 25% tokens) without channel loss. The process combines learned ( $k\times k$ ) convolution, $1\times1$ projection, and a final alignment to LLM embedding space. This achieves a $16\times$ reduction in attention FLOPs and can yield slightly improved VQA accuracy (66.3% vs. baseline 65.5% with 1.9TFLOPs vs. 7.6TFLOPs).
Dynamic Token Fusion in ToFu (Pippi et al., 6 Mar 2025): Employs a simple post-encoder, training-free sequential fusion: Each incoming vision token is tested for cosine similarity above a threshold $\tau$ with existing tokens. Similar tokens are averaged (weighted by their frequency), while distinctive tokens are kept. Dynamic trial-wise $\tau$ enables preservation of crucial details in high-token scenarios (multi-image, high-res). ToFu yields $\sim60\%$ reduction in vision prefix length and 66% GPU memory savings, while improving or preserving performance on challenging multi-image VQA benchmarks such as LLaVA-Interleave and ComPairs.

Approach	Token Reduction	Key Mechanism	Reported Gains
Spatial Token Fusion	$4\times$ (25%)	Learnable spatial conv	+0.8% VQA, $1/4$ FLOPs
ToFu	$2.5\times$ – $3\times$	Sequential similarity fusion	+2% ComPairs accuracy

Excessively aggressive downsampling or unlearned merging (e.g., random sampling) degrades performance, highlighting the need for adaptive or learnable fusion (Pippi et al., 6 Mar 2025, Tang et al., 8 Jun 2025).

3. Information Preservation: Multi-Granularity and Multi-Block Fusion

Effective token fusion must not only reduce redundancy but also preserve critical information across scales:

Multi-Block Token Fusion (MBTF) (Tang et al., 8 Jun 2025): Supplements reduced token sequences with multi-scale visual features by concatenating outputs from several intermediate ViT blocks. Two subsequent $1\times1$ convolutions reshape this high-dimensional feature tensor to the desired channel width before spatial fusion. MBTF ensures that both coarse and fine-grained information (e.g., edges, textures, semantics) are retained after aggressive STF reduction. Ablations show that MBTF alone can yield 66.6% VQA performance (vs. 66.3% MBTF+STF at 1/4 computation).
Q-transform and Q-bottleneck (FLUID) (Cuong et al., 10 Aug 2025): Learnable queries extract salient task-relevant tokens from each modality via cross-attention, followed by adaptive gating and a bottlenecked re-distillation. This mechanism compresses an $\ell$ -length token set down to $m$ -length core features, balancing computational efficiency with representational richness for robust expert ensembles.

Such hierarchical or multi-block designs address the shortfall of single-scale, single-pooling token fusion, which otherwise can miss localized fine details or globally relevant relations.

Modern multimodal token fusion increasingly exploits specialized attention-based or channel-compositional interfaces:

Compound Tokens (Channel Fusion) (Aladago et al., 2022): For vision-language tasks, vision tokens query text tokens via linear cross-attention, and vice versa. The attended representation is concatenated at the channel level with the original query, producing aligned compound tokens. Subsequent multimodal encoding uses standard self-attention, yielding state-of-the-art performance (e.g., VQA2.0 57.5%, SNLI-VE 81.49%, GQA 80.45%).
Token-wise Cross-Attention and Self-Attention Fusion: Models such as CLMLF (Li et al., 2022) and SFusion (Liu et al., 2022) stack self-attention layers or cross-modality fusion transformers over concatenated token sequences, capturing deep cross-modal relations. Further, TACOformer (Li, 2023) integrates token-wise and channel-wise cross-modal attention, explicitly modeling both sequence and embedding-dimension dependencies, leading to improved emotion recognition.
Pixel-wise Local Fusion (GeminiFusion) (Jia et al., 2024): GeminiFusion applies 2-token self+cross attention at each spatial location, balancing intra- and inter-modal cues with a layer-adaptive noise to prevent domination by self-inputs, achieving both linear computational scaling and SOTA fusion quality (outperforming TokenFusion by 2.6–3.4% mIoU and full cross-attention at a fraction of the cost).

Mechanism	Attention Structure	Notable Instantiation	Key Results
Channel fusion	Cross-attn, c-cat	Compound Tokens (Aladago et al., 2022)	+4.2% VQA2.0
Per-pixel attn	2-token attention	GeminiFusion (Jia et al., 2024)	+2.6–3.4% mIoU
Stack self-attn	Self/cross in cascade	CLMLF (Li et al., 2022), SFusion	+1–2% accuracy

5. Dynamic, Sparse, and Adaptive Fusion Strategies

To address heterogeneity, scalability, and domain robustness, several dynamic and expert-based mechanisms have emerged:

Sparse Fusion Transformers (SFT) (Ding et al., 2021): Aggressively pools unimodal tokens via block-pooling before multimodal encoding; achieves up to $6\times$ FLOPs/memory reduction while matching or outperforming naive concatenation and late fusion.
Mixture-of-States (MoS) (Liu et al., 15 Nov 2025): For diffusion-based multimodal generation, a token-wise, timestep-adaptive router selects which layer-wise context features (from a frozen LLM backbone) to present to each denoising block. Sparse top- $k$ selection (with $\epsilon$ -greedy) yields compute-efficient fusion and high parameter utilization per generation step.
Gating and MoE Routing (FLUID, SUMMER) (Cuong et al., 10 Aug 2025, Li et al., 31 Mar 2025): Cross-modal fusion tokens are adaptively weighted (gated) per sample or token position, with the downstream MoE (Mixture-of-Experts) providing expert specialization based on fused token cues. These methods are robust to label noise, imbalance, and semantic heterogeneity, with FLUID achieving +13% accuracy over prior baselines on GLAMI-1M.

Dynamic/Expert Fusion	Mechanism	Efficiency/Robustness Gain
SFT	Block-pool, sparse fusion	$1/6$ FLOPs, matched accuracy
MoS	Token-wise router, sparse	SOTA image/editing at $1/4$ size
MoE-based (FLUID)	Gating, expert assignment	91% accuracy, robust to noise

6. Limitations and Prospects

Despite impressive gains, existing multimodal token fusion schemes face several open challenges:

Information loss with aggressive fusion: Overlarge spatial kernels ( $k\geq 4$ in STF) or excessive downsampling in SFT degrades performance due to under-represented fine structure (Tang et al., 8 Jun 2025, Ding et al., 2021).
Fixed, grid-based patching: In regular-grid STF/MBTF and ToFu, fixed spatial regions may not capture small off-grid objects or non-rigid regions; future work may explore dynamic, adaptive, or attention-guided fusion of tokens (Tang et al., 8 Jun 2025, Pippi et al., 6 Mar 2025).
Limited learning in frozen encoders: Methods relying on post-hoc fusion of frozen encoders (MBTF, ToFu, SwimVG) cannot introduce new semantically meaningful features absent from the upstream models. Full or partial encoder adaptation remains a trade-off between flexibility and efficiency.
Extension to video and higher-dimensional fusion: Current fusion strategies are generally evaluated on 2D data or short sequences. Extending fusion mechanisms to videos, 3D point clouds, or graph-structured data remains under-explored.

Prospective extensions include: learnable or data-adaptive fusion granularity; cross-attention-guided patch allocation; deeper integration of task signals into token fusion; and further exploration of sparse/efficient transformer architectures.

7. Comparative Table: Select Representative Models

Model/Method	Fusion Style	Main Contribution	Public Benchmarks (Gain)
STF+MBTF (Tang et al., 8 Jun 2025)	Spatial/compositional	Learnable spatial+multi-block, ViT compatible	+0.8% VQA, $1/4$ FLOPs
CLMLF (Li et al., 2022)	Stacked Transformers	Multi-layer, contrastive loss	+0.016 F1 (MVSA-Single)
Compound Tokens (Aladago et al., 2022)	Channel concat	Cross-attn channel fusion	+4.2% VQA2.0, +2.2% GQA
SFT (Ding et al., 2021)	Sparse block pooling	Pre-fusion token sparsification	$\sim$ 6\times $lower compute</td> </tr> <tr> <td>FLUID (<a href="/papers/2508.07264" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Cuong et al., 10 Aug 2025</a>)</td> <td>Query distill, MoE</td> <td>Learnable token distill, adaptive MoE</td> <td>91% GLAMI-1M (+13%)</td> </tr> <tr> <td>GeminiFusion (<a href="/papers/2406.01210" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Jia et al., 2024</a>)</td> <td>Pixelwise 2-token attn</td> <td>Efficient linear-complexity per-pixel fusion</td> <td>+2.6–3.4% mIoU,$ <2$% latency

These diverse approaches indicate that high-quality multimodal fusion can be achieved with mechanisms that exploit both redundancy reduction and context- or task-aware feature mixing, often with minimal added computation.

References:

(Tang et al., 8 Jun 2025, Pippi et al., 6 Mar 2025, Cuong et al., 10 Aug 2025, Aladago et al., 2022, Li et al., 2022, Jia et al., 2024, Ding et al., 2021, Liu et al., 15 Nov 2025, Li et al., 31 Mar 2025, Li, 2023).