Token-Based Modality Fusion

Updated 15 March 2026

Token-based modality fusion is a technique that represents each data modality as discrete tokens to enable fine-grained, adaptive interaction and prevent modality collapse.
It leverages methods like adaptive token merging, cross-modal attention, and gating mechanisms to maintain feature diversity and enhance robustness.
This approach improves efficiency and performance in applications such as vision-language processing, 3D detection, and multimodal semantic segmentation.

Token-based modality fusion refers to a class of multi-modal fusion techniques that operate at the level of discrete feature tokens, aiming to effectively combine information from heterogeneous modalities—such as images, text, audio, or sensor data—within neural architectures that process inputs as sequences of tokens (e.g., Transformers, ViTs). Unlike early fusion (concatenation at feature input) or late fusion (output ensemble), token-based approaches typically perform fine-grained cross-modal operations (exchange, attention, gating, merging) on modality-specific token embeddings, providing controllable, adaptive, and theoretically grounded fusion capable of addressing modalities’ differing semantic roles, information densities, and statistical properties.

1. Principles of Token-Based Modality Fusion

Token-based modality fusion builds on the representation of each modality as a set or sequence of high-dimensional tokens, typically extracted from upstream networks (CNN/Transformer for images, BERT for text, etc.), each encoding local or semantic information. The fusion aims to:

Maintain discriminative power across all feature directions; counteracting “feature collapse,” where the joint representation’s eigenspectrum is dominated by a few principal axes and others become uninformative (Kim et al., 9 Nov 2025).
Prevent “modality collapse”, in which one modality’s features dominate, suppressing complementary information from another (Kim et al., 9 Nov 2025).
Enable fine-grained, context-dependent, and learnable interaction mechanisms across token representations from each modality.
Facilitate efficient and scalable processing by exploiting token-sparsity, pre-fusion merging, or expert specialization.

This paradigm includes a spectrum of methods, such as: adaptive token merging (Zhai et al., 2023), selective channel exchange (Kim et al., 9 Nov 2025, Wang et al., 2022), cross-modal attention (Zhou et al., 2023, Kim et al., 9 Nov 2025), dynamic gating (Cuong et al., 10 Aug 2025), mixture-of-experts specialization (Wang et al., 10 Mar 2026, Lin et al., 2024), and masking or contrastive alignment (Duan et al., 2024, Georgiou et al., 15 Apr 2025, Zhou et al., 2023).

2. Architectures and Formal Models

2.1 Token Scoring, Replacement, and Exchange

Several methods identify and act upon token “informativeness”:

Rank-Enhancing Token Fuser (RTF) detects under-utilized feature directions (channels) in one modality via singular value decomposition and Shannon-entropy effective rank. Channels with low projection onto leading singular vectors are deemed under-informative and selectively replaced, or adaptively blended, with complementary directions from the other modality, using per-channel learnable coefficients (Kim et al., 9 Nov 2025).
TokenFusion (for vision transformers) attaches per-token scoring heads to estimate informativeness. Tokens with scores below threshold $\theta$ are pruned and replaced by projections of aligned tokens from other modalities, with a residual positional alignment mechanism preserving spatial structure (Wang et al., 2022).

2.2 Adaptive Modality-Guided Merging and Attention

Fast-StrucTexT dynamically merges groups of tokens within one modality under the guidance of another, using learned per-token weights derived from cross-modal linear projections; merging and unmerging proceed in a hierarchical hourglass Transformer (Zhai et al., 2023).
Symmetry Cross Attention (SCA) alternates dual-phase cross-modal attention with standard self-attention, providing repeated, deep mixing of visual and textual streams at multiple temporal and structural granularities (Zhai et al., 2023).
GeminiFusion for vision tasks computes lightweight, pixel-wise cross-modal attention between spatially aligned tokens from different modalities, augmented by adaptive per-layer noise which ensures nontrivial cross-attention without overriding intra-modal information. This process avoids quadratic scaling and discards the need for masking or full exchange (Jia et al., 2024).

2.3 Mixture-of-Experts and Routing

Modality-aware Mixture-of-Experts (MoE) architectures, e.g., in MoMa and MDTrack, assign separate pools of expert FFN blocks to each modality. Tokens are routed to their modality-specific experts based on learned gates or routers, and intra-group adaptivity is achieved via per-expert gating weights (Wang et al., 10 Mar 2026, Lin et al., 2024).
Hierarchical routing enables further compute savings: MoMa’s per-modality expert-choice routing delivers large FLOPs reductions while preserving early fusion and cross-modal attention. Expert-choice (top- $k$ per expert) ensures balanced expert utilization with no auxiliary loss (Lin et al., 2024).

2.4 Token-Level Contrastive Alignment and Gated Fusion

FLUID employs learnable query tokens to distill salient features from unimodal token sets via cross-attention, aligns the resulting token embeddings via contrastive (InfoNCE) loss, and adaptively fuses token slots using a learned gating network before passing into a Q-bottleneck and an MoE prediction head (Cuong et al., 10 Aug 2025).
Adaptive fusion through gating—whether at channel (Kim et al., 9 Nov 2025), token (Wang et al., 15 Apr 2025, Cuong et al., 10 Aug 2025), or expert levels—enables the model to emphasize or ignore modality-specific information dynamically, addressing imbalances or noise.

3. Effective Rank, Collapse, and Theoretical Guarantees

A key mathematical foundation for several token-based fusion methods is the effective rank of a feature matrix $Z$ , defined as

$\mathrm{ERank}(Z) = \exp\left(-\sum_{j=1}^r p_j \log p_j\right),\quad p_j = \frac{\sigma_j(Z)}{\|Z\|_*}$

with $\sigma_j$ the singular values and $\|\cdot\|_*$ the nuclear norm (Kim et al., 9 Nov 2025).

Replacement of under-informative channels/tokens with features from a complementary modality, under bounded energy and alignment constraints, provably increases ERank, thereby mitigating feature and modality collapse.
Empirical results consistently show that increased effective rank in token-based fused representations correlates strongly with improved task performance (especially when visual context is limited or one modality is degraded) and robustness to noise (Kim et al., 9 Nov 2025, Wang et al., 2022).

4. Applications, Empirical Performance, and Efficiency

Token-based modality fusion is validated across diverse domains and benchmarks:

Vision and 3D tasks: Multimodal semantic segmentation (NYUDv2, SUN RGB-D), multimodal image-to-image translation (Taskonomy), and 3D object detection (KITTI, ScanNetV2) consistently benefit from explicit token-level fusion over concatenation or simple pooling (Jia et al., 2024, Wang et al., 2022, Cuong et al., 10 Aug 2025).
Document understanding: Fast-StrucTexT achieves state-of-the-art F1 scores on FUNSD, CORD, and SROIE, with up to 1.9× inference speedup (Zhai et al., 2023).
Composed image retrieval: Token-based, adaptive fusion in TMCIR leads to improved alignment of search intent in complex retrieval tasks (Fashion-IQ, CIRR) (Wang et al., 15 Apr 2025).
Vision-LLMs and communications: Models such as TaiChi apply bilateral token-attention and nonlinear KAN-based cross-modal projections to bridge granularity and semantic differences in image/text representations (Jiang et al., 28 Feb 2026).
VideoQA and intent recognition: Transformer-based QA models utilize token-level self-attention for joint video–text reasoning, with probes (QUAG) revealing that only careful design and explicit cross-modal attention enforce genuine fusion (Rawal et al., 2023, Zhou et al., 2023).
Scalability and Sparsity: Sparse Fusion Transformers and MoMa demonstrate that token sparsification—by pooling, scoring, or routing—enables 3–6× FLOPs/memory reduction with no accuracy compromise (Ding et al., 2021, Lin et al., 2024).

5. Ablation Studies, Robustness, and Design Insights

Systematic ablation and evaluation across architectures reveal the following insights:

Selective fusion outperforms uniform fusion or simple concatenation, especially in adverse or imbalanced settings (noise, missing modalities) (Kim et al., 9 Nov 2025, Wang et al., 2022, Duan et al., 2024).
Auxiliary architectural elements—such as residual positional alignment, per-token modal attention, and deep gated fusion tokens—provide additional regularization, stability, and incremental accuracy uplift (Cuong et al., 10 Aug 2025, Koutlis et al., 2023, Duan et al., 2024).
Cross-modality masked auto-encoding robustifies token-level representations against partial modality failure, directly improving real-world driving control under sensor damage (Duan et al., 2024).
Granularity and fusion depth matter: deeper token fusion (inserting fusion tokens across multiple decoder layers) consistently improves accuracy in, e.g., sentiment analysis, with diminishing returns at large depth or token count (Georgiou et al., 15 Apr 2025).
Mixture-of-experts with modality-aware routing delivers the best efficiency–performance tradeoff; e.g., MoMa achieves 3.7× overall FLOPs savings, outperforming standard MoEs and maintaining low inference loss even with imperfect routers (Lin et al., 2024).

6. Open Issues and Future Directions

Cross-modal inductive biases: Several analyses (e.g., QUAG, CLAVI) demonstrate that standard Transformer self-attention fusion does not guarantee strong cross-modal coupling, and models often exploit dataset biases. Explicit architectural or loss regularization is necessary to force models to utilize both modalities robustly (Rawal et al., 2023).
Fusion curriculum and representation order: Ablations demonstrate that order and initialization of cross-modal encoders, fusion tokens, and loss functions influence performance, with fine-tuned encoders and fusion-aware losses (contrastive, auxiliary per-modality) commonly yielding the strongest results (Georgiou et al., 15 Apr 2025, Cuong et al., 10 Aug 2025).
Scalability in large models: Modal-aware expert routing and efficient hierarchical token mechanisms (MoMa, Fast-StrucTexT, GeminiFusion) are key to scaling multimodal fusion to billion-scale parameter models without quadratic complexity or memory bottlenecks (Lin et al., 2024, Zhai et al., 2023, Jia et al., 2024).
Task-adaptive token selection/generation: Future directions may integrate task-aware and semantic-driven selection of fusion tokens, more principled information-theoretic measures, and dynamic structures that adapt token fusion patterns per sample or context.

7. Representative Methods and Results

Method	Core Mechanism	Efficiency/Accuracy Impact
RTF (Kim et al., 9 Nov 2025)	Rank-guided channel blending	+3.74% MoC (NTURGBD), ↑robustness
TokenFusion (Wang et al., 2022)	Token scoring and exchange	+1.7–2.1 mIoU, +1.0 mAP, +30% FID
Fast-StrucTexT (Zhai et al., 2023)	Guided hierarchical merging	+2–4% F1, 1.9× speedup
GeminiFusion (Jia et al., 2024)	Pixel-wise cross-attention	+2.6% mIoU, 99.2% fewer FLOPs
MoMa (Lin et al., 2024)	Modality-aware MoE	3.7× FLOPs saving, best loss @1T tokens
MDTrack (Wang et al., 10 Mar 2026)	MoE + decoupled temporal SSMs	+2.6%–2.1% AUC/F1 across tracking
MemeFier (Koutlis et al., 2023)	Dual-stage (alignment + Transf.)	+21–44% AUC over baseline
MaskFuser (Duan et al., 2024)	Joint tokenization + masked AE	+4.5% DS, +3.2% RC, ↑robustness
FLUID (Cuong et al., 10 Aug 2025)	Token QS + contrastive gating	91% acc. on GLAMI-1M, robust

Token-based modality fusion thus constitutes a foundational approach for scalable, robust deep multimodal modeling, with principled information-theoretic and architectural innovations systematically improving information preservation, efficiency, and cross-modal balance across a range of vision, language, temporal, and decision-making tasks (Kim et al., 9 Nov 2025, Wang et al., 2022, Zhai et al., 2023, Lin et al., 2024, Cuong et al., 10 Aug 2025, Jia et al., 2024, Zhou et al., 2023, Rawal et al., 2023, Duan et al., 2024).