Hybrid Transformer Multi-level Fusion

Updated 25 February 2026

Hybrid Transformer with Multi-level Fusion is a neural architecture that integrates global self-attention with layer-wise multimodal fusion to improve discriminative learning.
It employs prompt injection at each transformer layer to blend modality-specific cues and mitigate cross-modal noise while combining supervised and unsupervised contrastive losses.
Empirical results show enhanced performance in tasks like emotion recognition and video understanding, achieving 1–4 percentage points gains over traditional fusion methods.

A hybrid transformer with multi-level fusion is a neural architecture that interleaves transformer-based global self-attention with hierarchical information fusion across multiple network depths, typically blending both strong and weak modalities or distinct semantic scales. In the context of multimodal tasks such as emotion recognition, medical imaging, video understanding, or knowledge graph completion, these models harness the complementary strengths of each modality (e.g., text, audio, vision) and fuse them in a way that both preserves local detail and leverages long-range dependencies. Multi-level fusion schemes inject modality-specific information at several transformer layers, rather than relying solely on early or late fusion; this design mitigates cross-modal noise and supports discriminative representation learning. Hybrid training objectives, especially those including both supervised and unsupervised contrastive losses, further promote robustness and discrimination, particularly under class imbalance or rare-category regimes (Zou et al., 2023).

1. Core Principles and Model Architecture

The hybrid transformer with multi-level fusion operates via the following sequence:

Input Encoding: Each modality (strong and weak) is encoded independently. For example, text $T_0\in\mathbb{R}^{n\times d}$ may be extracted from BERT, audio $A\in\mathbb{R}^{m\times d}$ and video $V\in\mathbb{R}^{k\times d}$ from respective encoders.
Deep Cue Extraction: Stronger modalities (audio, video) are passed through "emotion-cue" MLPs to obtain lower-dimensional prompt vectors $C_A, C_V$ .
Prompt Generation: Fused mean representations from $C_A$ and $C_V$ (e.g., using pooling and learned affine projection) become prompt vectors $P_0\in\mathbb{R}^{l\times p}$ .
Layer-wise Prompt Injection: At each of $L$ transformer layers, prompts $P_{\ell-1}$ are concatenated with the text features $T_{\ell-1}$ in the attention computation, thus biasing self-attention towards distilled multimodal cues at every layer rather than just once.
Prompt and Feature Updates: After each attention block, both the text features and prompts are updated by residual, FFN, and a prompt-specific update block.
Multi-level Fusion: Features $T_1,\ldots,T_L$ and prompts $P_1,\ldots,P_L$ are aggregated via concatenation or learnable weighted sum to form a final representation $H_{\text{final}}$ .
Hybrid Loss: Two contrastive objectives are applied to $H_{\text{final}}$ —unsupervised InfoNCE between augmentations, and supervised contrastive loss over class labels—merged via a balancing hyperparameter (Zou et al., 2023).

2. Mathematical Foundations and Layer Operations

The central innovation is the prompt-augmented, multi-level fusion at each transformer layer. For a single layer $\ell$ :

Attention with Prompts:

$Q = T_{\ell-1} W_Q,\quad K = [T_{\ell-1}; P_{\ell-1}] W_K,\quad V = [T_{\ell-1}; P_{\ell-1}] W_V$

where $[;]$ denotes concatenation.

Prompt-aware Bias:

$\text{scores} = \frac{Q K^T}{\sqrt{d}} + f(P_{\ell-1})$

where $f(\cdot)$ maps prompts to a per-head bias.

Update:

$T_\ell = \text{LayerNorm}\left(T_{\ell-1} + \text{Dropout}(A \cdot V)\right)$

Prompts are updated by a small MLP:

$P_\ell = \text{LayerNorm}\left(P_{\ell-1} + \text{ReLU}(P_{\ell-1} W_{u1} + b_{u1}) W_{u2} + b_{u2}\right)$

Multi-level Aggregation: After all $L$ layers, final representations for classification are formed by either:

$H_{\text{final}} = \text{concat}(T_L,\, \text{mean}_\ell\, T_\ell,\, \text{mean}_\ell\, P_\ell)\cdot W_f + b_f$

$H_{\text{final}} = \sum_{\ell=1}^L \alpha_\ell T_\ell + \sum_{\ell=1}^L \beta_\ell P_\ell, \quad \sum \alpha_\ell = \sum \beta_\ell = 1$

3. Fusion Strategies and Multi-Level Variants

Hybrid transformers with multi-level fusion are not limited to prompt-based mechanisms, but the unifying trait is the repeated interleaving of fusion operations within a deep transformer (as opposed to fixed early or late fusion). Examples include:

Prompt-Injection Layers: As in the Multimodal Prompt Transformer, prompts derived from strong modalities are re-injected at each transformer layer (Zou et al., 2023).
Hierarchical Cross-Modal Attention: Other models (e.g., knowledge graph completion (Chen et al., 2022), medical imaging (Cho et al., 2023)) use coarser prefix-guided or correlation-aware fusion modules at every transformer or network depth.
Adaptive Residual and Gating: Multi-level fusions can aggregate intermediate block outputs (e.g., via learned gates, adaptive residuals), aligning with the architectural patterns in vision, medical, or speech models (EL-Assiouti et al., 2024, Chen et al., 2022).

Fusion Paradigm	Example Model	Fusion Points
Prompt-based per layer	Multimodal Prompt Transformer (Zou et al., 2023)	Every transformer layer
Prefix-guided & correlation	MKGformer (Chen et al., 2022)	M-Encoder final L layers
Residual hierarchical (skip)	HFTrans (Cho et al., 2023)	Early, mid, late blocks

4. Hybrid Training Objectives and Robustness

Hybrid transformers with multi-level fusion typically combine standard supervised cross-entropy with supervised and unsupervised contrastive losses:

Unsupervised InfoNCE: Encourages global instance discrimination via random augmentations, shaping representations to separate different instances.

$\mathcal{L}_{\text{unsup}} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{sim}(z_i, z_{i^+})/\tau)}{ \sum_{j=1}^{2N} \mathbb{1}_{j \neq i} \exp(\text{sim}(z_i, z_j)/\tau)}$

Supervised Contrastive: Clusters representations within the same emotion or semantic class, mining information from few-sample classes.

$\mathcal{L}_{\text{sup}} = -\frac{1}{N} \sum_{i=1}^N \frac{1}{|P(i)|} \sum_{p\in P(i)} \log \frac{\exp(\text{sim}(z_i, z_p)/\tau)}{\sum_{a=1}^{2N} \exp(\text{sim}(z_i, z_a)/\tau)}$

Hybrid Loss: Weighted sum of both,

$\mathcal{L}_{\text{HCL}} = \alpha \mathcal{L}_{\text{unsup}} + (1-\alpha) \mathcal{L}_{\text{sup}}$

This balanced regimen improves rare-class discrimination and enables robust modality fusion, as empirically confirmed on emotion recognition and ablation studies (Zou et al., 2023).

5. Empirical Results and Design Rationale

Models employing hybrid transformers with multi-level fusion consistently surpass single-modality and shallow-fusion baselines across diverse tasks:

Emotion Classification: Outperforms previous SOTA on benchmark ERC datasets by guiding textual features with strong audio/visual cues at every layer (Zou et al., 2023).
Speech, Video, and Knowledge Graphs: Demonstrated systematic improvements in automatic speech recognition, text-video retrieval, and multimodal knowledge graph completion via repeated fusion (Lohrenz et al., 2021, Liu et al., 2022, Chen et al., 2022).
Ablation Findings: Layer-wise fusion at multiple depths confers higher generalization and discriminative power than “one-shot” fusion before or after transformer processing:
- PGI (prefix-guided fusion) saves 1–3 percentage points across multimodal tasks.
- Correlation-aware fine-grained fusion adds a further 1–4 point gain (Chen et al., 2022).

The primary rationale is that repeated, deep interleaving of cross-modal fusion allows both weak modalities (e.g., text in noisy emotion recognition) and strong ones (audio/video) to influence the transformation of representations throughout the network’s hierarchy, fundamentally improving signal alignment and noise suppression.

6. Representative Pseudocode and Layer Equations

A canonical MPT layer for hybrid fusion can be expressed as:

def MPTLayer(T_prev, P_prev):
    Q = T_prev @ W_Q                 # text queries
    K = concat(T_prev, P_prev) @ W_K # text+prompt keys
    V = concat(T_prev, P_prev) @ W_V # text+prompt values
    scores = (Q @ K.T) / sqrt(d) + f(P_prev)
    A = softmax(scores)
    T_hat = A @ V
    T_out = LayerNorm(T_prev + Dropout(T_hat))
    T_out = LayerNorm(T_out + Dropout(FFN(T_out)))
    P_update = ReLU(P_prev @ W_u1 + b_u1) @ W_u2 + b_u2
    P_out = LayerNorm(P_prev + P_update)
    return T_out, P_out

After

L

layers, the outputs

\{T_\ell\}, \{P_\ell\}

are aggregated as described. This facilitates stepwise multi-modal information refinement (Zou et al., 2023).

7. Broader Impact and Ongoing Directions

The hybrid transformer with multi-level fusion paradigm provides a rigorous, extensible framework for multimodal and hierarchical learning. Ongoing research extends these concepts to:

Uncertainty-aware and competitive expert mixtures (Jinfu et al., 27 Jul 2025)
Dynamic, data-driven modality weighting (Chen et al., 2022, Chen et al., 2022)
Large-scale tasks in computer vision, speech recognition, and recommendation systems

A plausible implication is that further advances in controllable, dynamic, or prompt-driven fusion may expand both the robustness and interpretability of deep multimodal architectures in open-world settings.