Papers
Topics
Authors
Recent
2000 character limit reached

Token-Based Modality Fusion

Updated 30 January 2026
  • Token-based modality fusion is a method that treats modality outputs as token sequences to capture fine-grained, context-dependent inter-modal correlations.
  • It employs techniques like alignment-aware merging, masked fusion, and adaptive gating to selectively integrate information from diverse data sources.
  • Empirical studies show that token-level fusion improves performance in tasks such as image retrieval, sentiment analysis, and autonomous driving compared to global fusion methods.

Token-based modality fusion denotes a class of methods that operate over discrete or continuous token embeddings to integrate information from multiple modalities—such as vision, language, audio, or structured attributes—by learning fine-grained, often position- and context-dependent, dependencies at the token level within deep networks. This approach contrasts with global vector fusion or scalar gating by leveraging the native structure of token sequences, allowing selective, localized, and often adaptive mixing of modalities via operations such as transformer attention, gating, alignment-aware merging, and expert routing. Token-based fusion constitutes the backbone of modern multimodal architectures that pursue both robustness to modality-specific noise and optimal exploitation of cross-modal correlations.

1. Foundational Principles of Token-Based Fusion

The core principle behind token-based modality fusion is to treat the representations from each modality as sets or sequences of tokens—where each token can represent a spatial patch, sequence step, contextual word embedding, or other atomic input feature (Koutlis et al., 2023, Wang et al., 2022, Duan et al., 2024). The goal is to unify these heterogeneous token sets in a way that exposes and preserves rich correlations or complementarities at a granular level.

Typically, token representations are generated via modality-specific encoders (e.g., CLIP ViT for images, BERT/M-BERT for text, discrete VAEs for image quantization, or CNNs for audio/features), yielding sets of vectors {oig}\{o_i^g\}, {ojx}\{o_j^x\}, etc. Fusion then entails either concatenating, aligning, or adaptively modifying these token sequences via additional transformer blocks or dedicated fusion modules, thus enabling local semantic integration.

These fusion schemes are motivated by the hypothesis that multi-modal tasks (e.g., meme classification, situated retrieval, autonomous driving, or multimodal sentiment analysis) require expressive models that can reason over spatial, sequential, and semantic alignment at the sub-feature or token level rather than at the coarse global embedding level.

2. Fusion Architectures and Mechanisms

Alignment-aware and Transformer-based Fusion

Architectures such as MemeFier implement dual-stage fusion: first, an alignment-aware fusion stage computes element-wise products between each image patch and global text vector (and vice versa), producing "text-aware" image tokens and "image-aware" text tokens (Koutlis et al., 2023). In the second stage, all fused tokens—including external metadata and a [CLS] token—are processed by a standard multi-head Transformer encoder, which learns higher-order inter-token and cross-modality dependencies via attention:

Attention(Q,K,V)=softmax(QKdk)V\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\bigl(\tfrac{Q K^\top}{\sqrt{d_k}}\bigr)V

Unified Tokenization and Masked Fusion

MaskFuser translates dense feature maps from both image and LiDAR branches into a single vocabulary of semantic tokens via patch-wise quantization and aggregation. Early fusion is achieved via geometric attention (e.g., monotonic-to-BEV translation), followed by late fusion in a shared transformer encoder operating on unified token streams (Duan et al., 2024). A masked autoencoder objective, with random token masking and reconstruction, further enforces that the token embedding space is equally informative for either modality, yielding improved robustness under sensor damage.

Dynamic Token Identification and Substitution

TokenFusion dynamically detects uninformative tokens at each transformer layer via a learned scoring MLP, sparsifies token streams by thresholding, and selectively substitutes low-score tokens with aligned projected features from complementary modalities. Residual positional alignment ensures correct spatial correspondence after substitution (Wang et al., 2022). This plug-in approach minimally disrupts the model’s original transformer architecture while enabling adaptive, fine-grained fusion.

Learnable Fusion and Cross-modal Gating

Fusion depth and dedicated multimodal capacity are studied in DeepMLF, where learnable fusion tokens are appended to the LM backbone and progressively updated via causal self-attention and gated cross-attention with audiovisual encoders. Gating mechanisms regulate the intensity of fusion at each layer, preventing either text or AV overdominance and optimizing fusion depth between 5–7 layers (Georgiou et al., 15 Apr 2025).

Cross-modal Similarity and Adaptive Merging

Mechanisms such as Adaptive Token Fusion (TMCIR) leverage pairwise similarity between visual and text tokens to identify matched pairs, which are merged with weights directly derived from cosine similarity; unmatched tokens are retained, and the fused token set is pooled and projected for downstream contrastive learning (Wang et al., 15 Apr 2025). This dynamic balancing attempts to optimize user intent capture in composed image retrieval.

Mixture-of-Experts and Specialized Routing

FLUID implements token-level distillation into “latent queries” via trainable cross-attention queries, enforces cross-modal consistency via contrastive alignment, fuses tokens adaptively with per-token gating and Q-bottleneck compression, and applies a load-balanced Mixture-of-Experts at classification time to specialize to semantic heterogeneity (Cuong et al., 10 Aug 2025). MoMa routes image and text tokens to modality-specialist expert pools for efficiency, further splitting tokens hierarchically by expert choice and, optionally, in depth (Lin et al., 2024).

Token Merger, Sparse Fusion, and Complexity Reduction

Fast-StrucTexT and SFT focus on efficient fusion. Fast-StrucTexT uses modality-guided dynamic token merging, where cross-modal attention from the other modality predicts importance scores determining merge weights. Each block merges kk contiguous tokens via weighted pooling (Zhai et al., 2023). SFT applies block-sparse attention and non-overlapping pooling, drastically reducing the token set prior to multimodal fusion, subsequently enabling dense fusion over a compressed sequence (Ding et al., 2021).

3. Mathematical Formulations and Token Operations

Across methods, key mathematical operations include:

  • Element-wise token alignment (MemeFier):\

fig=oigoxf^g_i = o^g_i \odot o^x

  • Masking and reconstruction (MaskFuser):\

LMAE=λimximxˉim22+λlixlixˉli22L_{MAE} = \lambda_{im} \|x_{im} - \bar{x}_{im}\|_2^2 + \lambda_{li} \|x_{li} - \bar{x}_{li}\|_2^2

  • Token scoring & binarization (TokenFusion):\ sl(eml)=σ(Wseml+bs)s^l(e_m^l) = \sigma(W_s e_m^l + b_s)\

Mml[i]=1M_m^l[i] = 1 if sl(eml)[i]θs^l(e_m^l)[i] \geq \theta

  • Gated fusion (FLUID):\

Fj,:=ajIn,j,:+(1aj)Tn,j,:F_{j,:} = a_j\,I_{n,j,:} + (1-a_j)\,T_{n,j,:}

  • Mixer with learned weights (MMoT):\

Tfused(i)=m=0Mαm(i)Fi,m,:T_{\mathrm{fused}}(i) = \sum_{m=0}^M \alpha_m(i) F_{i,m,:}

Token-wise operations thus enable learned or data-driven weighting, replacement, merging, or specialization, supporting context-dependent fusion strategies.

4. Optimization Objectives and Regularizers

Token-based fusion schemes are trained using various objectives:

Notably, several works emphasize robust regularization to counter modality imbalance, feature collapse, or overdominance (Kim et al., 9 Nov 2025, Cuong et al., 10 Aug 2025). For instance, effective rank maximization in RTF is theoretically ensured via selective channel blending, without explicit auxiliary losses.

5. Empirical Outcomes and Robustness

Experiments across domains validate token-based fusion's superiority:

  • MemeFier achieves state-of-the-art meme classification via dual-stage fusion and token-level transformer processing (Koutlis et al., 2023).
  • MaskFuser yields a driving score of 49.05 and route completion of 92.85% on CARLA, outperforming prior baselines by up to 30.9% under heavy sensor masking (Duan et al., 2024).
  • TokenFusion reduces FID by 29.8% in image translation and improves mIoU in segmentation by 2.8 points over vanilla fusion (Wang et al., 2022).
  • DeepMLF demonstrates optimal fusion at 5–7 layers and with ~20 fusion tokens, while outperforming previous Multimodal Sentiment Analysis systems (Georgiou et al., 15 Apr 2025).
  • TMCIR’s adaptive token fusion raises Fashion-IQ R@10 from 29.68% to 56.57%, indicating strong intent capture (Wang et al., 15 Apr 2025).
  • Sparse Fusion Transformer achieves up to 6-fold reduction in FLOPs and memory without trade-off in accuracy (Ding et al., 2021).
  • FLUID gains 91% accuracy on GLAMI-1M, while ablations confirm that each token operation contributes non-trivial robustness and generalization improvements (Cuong et al., 10 Aug 2025).
  • MMoT’s block-level mixer further yields improved FID by 18 in multimodal image synthesis, surpassing naive concatenation (Zheng et al., 2023).

Robustness to missing, noisy, or unbalanced modalities is commonly demonstrated: masked fusion enables hallucination of missing patches (Duan et al., 2024); specialized gating and mixture-of-experts support adaptation to semantic heterogeneity and long-tailed distributions (Cuong et al., 10 Aug 2025, Lin et al., 2024); balanced loss sampling counteracts modality imbalance during conditional synthesis (Zheng et al., 2023).

6. Challenges, Limitations, and Future Directions

Practical limitations have been identified through probing frameworks such as QUAG: many VideoQA transformer architectures exploit shortcut unimodal cues and fail to fuse tokens in a strongly multimodal fashion, achieving high accuracy even when crossmodal token mixing is disabled (Rawal et al., 2023). This has motivated more granular diagnostic probes, regularization strategies for inter-modal attention, and adaptive token gating.

Challenges remain in scaling fusion depth efficiently (DeepMLF), balancing computational and memory cost (SFT, Fast-StrucTexT), avoiding representation collapse (RTF), and ensuring causal consistency in sparse routing (MoMa+MoD).

Emerging directions focus on extending token-based fusion to arbitrary modality compositions (e.g., vision, text, radar), modality-specific expert allocation, adaptive fusion depth, stability under massive scale (Chameleon), contrastive curriculum design, and fine-grained prompt learning.

7. Tabular Summary of Canonical Token-Based Fusion Methods

Method Fusion Principle Domain(s)
MemeFier (Koutlis et al., 2023) Dual-stage, token-level alignment then transformer Meme hate speech; image+text+attribute
MaskFuser (Duan et al., 2024) Unified tokenization + masked AE Vision (image + LiDAR); autonomous driving
TokenFusion (Wang et al., 2022) Score & substitute tokens; projection Vision, segmentation, detection
DeepMLF (Georgiou et al., 15 Apr 2025) Layered learnable tokens + gated cross-attention Multimodal sentiment (AV+text)
TMCIR (Wang et al., 15 Apr 2025) Pairwise similarity-based token merging Composed image retrieval
FLUID (Cuong et al., 10 Aug 2025) Q-transform distillation + adaptive gating + MoE Multimodal product classification
SFT (Ding et al., 2021) Fixed block-sparse token pooling pre-fusion Video, audio, multimodal classification
Fast-StrucTexT (Zhai et al., 2023) Modality-guided token merging, SCA Document understanding (text+layout)
RTF (Kim et al., 9 Nov 2025) Channel-level blending by effective rank Action anticipation (RGB+depth)
MoMa (Lin et al., 2024) Modality-aware expert routing; layer-wise splitting Large-scale early-fusion LMs
MMoT (Zheng et al., 2023) Mixer with block-wise adaptive modality weights Composed multimodal image synthesis
Chameleon (Team, 2024) Quantized token interleaving + shared self-attention Foundational mixed-modal models

This table, derived directly from the cited works, summarizes fusion mechanisms and application areas.


Token-based modality fusion thus comprises a spectrum of architectures, algorithms, and objectives, all seeking principled, efficient, and robust cross-modal integration at the token level. This paradigm governs leading advances in multimodal classification, synthesis, autonomous control, document understanding, and foundation model design. Empirical and theoretical analyses indicate that fine-grained token-centric strategies—including attention, gating, replacement, expert assignment, and adaptive merging—are indispensable for achieving generalizable, noise-resilient, and interpretable multimodal models.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-based Modality Fusion.