Token-level Fusion in Multimodal Models

Updated 25 September 2025

Token-level fusion is a technique that selectively integrates token representations from neural encoders to enhance efficiency and interpretability.
It employs dynamic selection, score-driven substitution, and multi-criteria merging to balance information retention with computational efficiency.
This approach is applied in vision, language, and multimodal tasks, achieving state-of-the-art results and reduced inference overhead.

Token-level fusion refers to methods that operate on or across the discrete token representations produced by neural encoders in multimodal or unimodal learning architectures. The goal is to enhance learning by selectively integrating, merging, or aligning tokenwise features across modalities, levels, or models. These approaches have found applications in vision, language, speech, and multimodal tasks, yielding advances in efficiency, interpretability, and performance. The following sections detail representative frameworks, mathematical principles, integration strategies, task-specific results, and current research directions.

1. Dynamic and Selective Fusion Mechanisms

A central theme in token-level fusion is the dynamic selection and transformation of tokens to optimize the information flow between modalities or within large token sets. One canonical mechanism, TokenFusion for Vision Transformers (Wang et al., 2022), identifies “uninformative” tokens—those deemed unnecessary for discriminative learning—through an importance score $s^{(l)}(e_m^{(l)}) = \mathrm{MLP}(e_m^{(l)})$ , with a threshold $\theta$ . Rather than discarding these, such tokens are replaced at each layer by a projection of features from another modality:

$e_m^{(l)} = e_m^{(l)} \odot \mathbb{1}[s^{(l)}(e_m^{(l)}) \geq \theta] + \mathrm{Proj}^{M_{m'}}(e_m^{(l)}) \odot \mathbb{1}[s^{(l)}(e_m^{(l)}) < \theta].$

This preserves maximum spatial and semantic information while allowing inter-modal interactions precisely where single-modal representations are weak. Residual positional alignment (RPA) ensures that even after token substitution, original positional embeddings are preserved for accurate downstream fusion.

Selective fusion can also be governed by multiple criteria. For example, Multi-criteria Token Fusion (MCTF) (Lee et al., 15 Mar 2024) integrates token similarity (measured by cosine similarity), informativeness (via attention weights), and size (how many tokens have already merged into one) using an attraction function:

$W(x_i, x_j) = \prod_k (W^k(x_i, x_j))^{\tau_k},$

where $\tau_k$ are temperature hyperparameters. A bidirectional bipartite soft matching is then performed to pair tokens for fusion, with informativeness peeking one layer ahead to protect attentive tokens.

2. Fusion Strategies and Architectural Placement

The placement and granularity of token fusion have a substantial impact on performance and representation power. Strategies include:

Parallel/Late Fusion: In late token fusion architectures (Choi et al., 2022), branches based on CNN (local tokens) and ViT (global tokens) process features independently, then concatenate or fuse at the output stage. This preserves independent learning in each branch, with alignment via upsampling and channel transformations.
Early Fusion: Early token fusion fuses representations before any deep feature extraction, typically via bridge blocks combining raw or shallow features. However, empirical results (Choi et al., 2022) show this approach is usually less effective, as it may reduce the ability of specialized encoders to form strong unimodal features before interacting.
Layer-by-layer Fusion: Layerwise fusion merges dual token streams at multiple network levels, often using mixing blocks with both attention and convolutional modules. This approach enables progressive mixing of local and global features, maximizing the benefits of complementary representations. Empirically, it yields the highest ImageNet-1K scores, with top-1 classification accuracy reaching 87.77% (Choi et al., 2022).
Cross-modal Cases: In modern multimodal transformers, token fusion can include precise spatial alignment, as in mapping point cloud tokens to image patch tokens using camera calibration matrices (Wang et al., 2022), or identity mapping for homogeneous modalities.
Cross-layer Selectivity: Famba-V (Shen et al., 15 Sep 2024) demonstrates that selectively applying token fusion in upper, lower, or interleaved layers improves training efficiency. Fusion by averaging similar tokens (based on cosine similarity) in higher layers is most effective for balancing computational efficiency and accuracy, reducing training time and peak memory while preserving top-1 accuracy on CIFAR-100.

3. Mathematical and Algorithmic Formulations

Token-level fusion methods employ rigorous mathematical criteria to decide which tokens to merge, prune, or substitute, and how to perform the fusion. Typical formulations include:

Scoring functions: $s^{(l)}(e)$ for importance, $W^{\mathrm{sim}}, W^{\mathrm{info}}, W^{\mathrm{size}}$ for multi-criteria merging.
Fusion operators: Weighted averaging or norm-preserving merges (e.g., MLERP, a generalized SLERP preserving feature norm (Kim et al., 2023)); sum pooling over attentively matched token pairs (Meng et al., 3 Jun 2024).
Attention-based alignment: Use of cross-attention layers to align features before channel concatenation for compound tokens (Aladago et al., 2022), or cross-attention along with self-attention to combine intra- and inter-sequence dependencies (Meng et al., 3 Jun 2024).
Information-theoretic perspectives: RTF (Guo et al., 21 Oct 2024) justifies its random token sampling procedure by appealing to mutual information: introducing fusion randomness increases entropy, counteracts overfitting, and enhances robustness for multi-view learning.

4. Empirical Benchmarks and Applications

Token-level fusion has demonstrated substantial gains across modalities and tasks:

Task/Domain	Representative Fusion Type	Empirical Highlights
Vision (ImageNet-1K)	Layerwise, Late, Early	Layerwise fusion: 87.77% top-1, outperforms baselines
Multimodal vision (Taskonomy, NYUDv2)	Token substitution & alignment	Superior FID/KID, pixel/mIoU, new SOTA 3D detection
Vision-Language QA (VQA2.0)	Channel fusion (compound tokens)	Higher accuracy, lower memory than merged/co-attention
Drug-Target Interaction	Bilinear/Cross-attention	AUROC 0.989, highlights fine-grained binding sites
Medical Imaging	Random token fusion	Improves AUC, reduces overfitting, no inference cost

In addition, token-level fusion has proven critical for efficient transformers (ToFu (Kim et al., 2023), MCTF (Lee et al., 15 Mar 2024)), enabling up to 44% FLOPs reduction while slightly increasing or maintaining accuracy, and for boosting contextual ASR and OCR (through late-fusion of LLM and acoustic model logits (Zhou et al., 16 Jul 2025, Hsu et al., 23 May 2024)).

5. Advanced Fusion in Model and Distribution Alignment

Token-level fusion is actively researched for fusing knowledge across distinct models and architectures. Recent frameworks advance from hard alignment to structure-aware, differentiable criteria:

Distributional Alignment via OT: PTA-LLM (Zeng et al., 21 Sep 2025) formulates token alignment as an optimal transport problem, using a cost matrix (e.g., edit distance) and dynamic programming for soft pairings. An iterative Sinkhorn algorithm solves for a probabilistic transport map, fusing logit distributions according to global structure.
Graph-on-Logits Distillation: InfiGFusion (Wang et al., 20 May 2025) constructs a global “co-activation graph” of top- $k$ logits across sequence positions and aligns these graphs between source and target models via efficient Gromov-Wasserstein (GW) distance. The result is enhanced preservation of inter-token dependencies, with an efficient $O(n \log n)$ approximation. Gains are particularly notable (+35.6% in Multistep Arithmetic, +37.06% in Causal Judgement over SFT) in tasks requiring relational reasoning.
Fusion for Differential Privacy: DP-Fusion (Thareja et al., 6 Jul 2025) achieves token-level differentially private inference by fusing LLM outputs keyed on “sensitive” and “public” context partitions. A parameterized mixture weight and Renyi divergence bound ( $\alpha, \beta_i$ ) are used at every step, offering a tunable privacy/utility tradeoff well beyond baseline decoding approaches.

6. Challenges, Interpretability, and Research Directions

Information Loss and Consistency: Careful design is needed to avoid over-pruning or premature fusion that would erase critical fine-grained details (Kim et al., 2023, Lee et al., 15 Mar 2024). Multi-criteria and “one-step-ahead” attention mechanisms are emerging as robust mitigations.
Explainability and Semantic Control: Fuzzy-membership-based token-level semantic fusion (Huang et al., 14 Sep 2025) and learned fusion tokens for deep multimodal fusion (Georgiou et al., 15 Apr 2025) demonstrate how interpretable or user-controllable attributes can be propagated alongside standard embeddings, improving both perplexity and controllability in generation.
Cross-domain Generality: The plug-and-play nature of several fusion frameworks (e.g., ToFu (Pippi et al., 6 Mar 2025), MCTF (Lee et al., 15 Mar 2024), Channel fusion (Aladago et al., 2022)) means they can generalize to new architectures and modalities without retraining, broadening their application scope.
Adaptive and Efficient Implementation: Adaptive thresholding, fusion schedule optimization, hardware-aware design, and end-to-end differentiable selection mechanisms remain open areas for efficiency and performance.
Unifying Model Fusion Theory: Recent theory (Xu et al., 3 Jun 2024) establishes formal grounds (using information-theoretic divergence and representation similarity) to predict, at the token level, whether external knowledge provides net benefit or harm, informing practical collaborative generation and harmonization approaches.

7. Summary Table: Core Token-Level Fusion Paradigms

Paradigm	Mechanism	Representative Paper	Domain
Dynamic Substitution	Score-driven token replacement	(Wang et al., 2022)	Multimodal Vision
Channel Concatenation	Cross-attention channel stacking	(Aladago et al., 2022)	Vision-Language
Multi-criteria Merging	Similarity, informativeness, size	(Lee et al., 15 Mar 2024, Guo et al., 21 Oct 2024)	Vision, Medical Imaging
Model Alignment	Optimal transport, graph distill.	(Zeng et al., 21 Sep 2025, Wang et al., 20 May 2025)	Model Fusion for LLMs
Fuzzy Semantic Side-Channel	Gated parallel semantic vector	(Huang et al., 14 Sep 2025)	Language Modelling
Late/Layered Integration	Flexible fusion scheduling	(Choi et al., 2022, Shen et al., 15 Sep 2024)	Vision
Privacy Preserving	Renyi-bounded token mixing	(Thareja et al., 6 Jul 2025)	Privacy, Paraphrasing

These paradigms collectively illustrate the diversity and technical depth of token-level fusion, with growing sophistication in dynamic selection, alignment, interpretability, and efficiency. Ongoing research continues to refine these approaches for broader applicability and scalability across modalities and real-world tasks.