InterFusion Module: Cross-Modal Fusion

Updated 5 January 2026

InterFusion modules are advanced architectural designs that integrate heterogeneous inputs using learnable gating, attention, or shared feature spaces.
They are applied across domains such as neuroscience, medical imaging, emotion recognition, and audio to achieve robust, parameter-efficient performance.
These modules enable adaptive, context-aware fusion that improves predictions and restoration by dynamically balancing multi-modal information.

The InterFusion module refers to a family of architectural and algorithmic designs for cross-modal fusion appearing across diverse domains, notably neuroscience, computer vision, medical imaging, speech processing, and generative modeling. These modules systematically integrate information from multiple modalities or representations using learnable, context-adaptive mechanisms—commonly involving gating functions, attention, or shared feature spaces—rather than static concatenation or naive aggregation. While implementations differ, the defining principle is the explicit, often parameter-efficient, fusion of heterogeneous inputs to yield domain-specific improvements in prediction, restoration, or generation.

1. General Principles and Mathematical Formulation

InterFusion modules typically operate by first embedding each unimodal input into a shared or aligned feature space, followed by merging them through a parameterized operator. The most common approach utilizes gating, where a learnable function determines the elementwise contribution of each modality to the fused representation.

Consider two modality embeddings $A, B \in \mathbb{R}^{B\times K\times d}$ (batch, spatial/temporal index, feature-dim). The canonical fusion operation is: $G = \sigma([A \| B] W_g + b_g) \in (0,1)^{B\times K\times d}$

$F_{\text{fused}} = G \odot A + (1 - G) \odot B$

where $\|$ denotes concatenation and $\odot$ denotes element-wise multiplication. Alternatives include bidirectional gated residual mixing, as in emotion recognition, or even hybrid self-attention as used in transformer-based fusion of vision and text tokens.

Modules may further stack multiple such fusions with intermediate transformers, normalization, or recurrence, optionally employing LoRA-like parameter-efficient adapters for robust adaptation to data distribution shifts or degradation types.

2. Domain-Specific Instantiations

Neuroimaging: BrainSymphony Adaptive Fusion Gate

In BrainSymphony, functional and structural brain data—specifically, fMRI and diffusion MRI-derived structural connectomes—are fused via the adaptive fusion gate. After projecting both modality embeddings to a common latent size, a linear gate produces a soft mask governing the relative importance of each modality at each brain region and feature dimension:

$E_{\text{fMRI}} \in \mathbb{R}^{B \times R \times d_f}$ , $E_\text{SC} \in \mathbb{R}^{B \times R \times d_s}$
Project to $d_h$ ; compute gate $G$ via $\mathrm{sigmoid}$ ; output $E_{\text{fused}} = G \odot E_{\text{fMRI}} + (1-G) \odot E_{\text{SC}}$ No normalization or dropout is used within the fusion gate; regularization is managed elsewhere in the model (Khajehnejad et al., 23 Jun 2025).

Multimodal Emotion Recognition: Bidirectional Gated Residual InterFusion

On the iMiGUE dataset, the InterFusion module fuses per-chunk facial and visual context embeddings. After linear projection and a transformer encoding per stream, a gating block computes $\boldsymbol{\alpha} = \sigma([C; F] W_\alpha + b_\alpha)$ , then fuses both directions: $C' = C + \boldsymbol{\alpha} \odot F, \quad F' = F + \boldsymbol{\alpha} \odot C$ LayerNorm is applied post-residual addition. This process recurs at multiple points in the pipeline, enabling deep iterative cross-modal alignment. The block is computationally efficient and enables hierarchical refinement (Martirosyan et al., 29 Dec 2025).

Medical Imaging: UFR-RF with ALSN Skip-Path InterFusion

In medical image restoration and fusion, the Universal Feature Restoration & Fusion (UFR-RF) block fuses features from degraded and reference images at multiple U-Net scales. Each skip connection incorporates an Adaptive LoRA Synergistic Network (ALSN), combining a main path with multiple low-rank, degradation-aware LoRA branches: $\widetilde{F}_{i,\text{rf}} = \mathcal{B}_i(\widetilde{F}_i, \hat P) + \sum_{j=1}^{\tilde N} p_j \mathcal{R}_{ij}(\widetilde{F}_i, \hat P)$ where $p_j$ are degradation-class weights. This single-stage fusion paradigm contrasts with classical three-stage cascades by jointly optimizing alignment, restoration, and fusion, reducing parameter count and error accumulation (Su et al., 28 Jun 2025).

Vision-Language Fusion: DiTFuse Transformer-Stack InterFusion

In DiTFuse, "InterFusion" denotes a stack of DiT transformer blocks jointly attending to two images and instruction text. Text and visual tokens are concatenated into a shared sequence; hybrid self-attention allows intricate cross-modal aggregation governed by causal and bidirectional masks: $S_{\ell} = \text{Layer}_\ell(S_{\ell-1}), \quad \ell=1\ldots N$ No GAN or perceptual loss is used; all supervision is via flow-matching objectives constructed from multi-degradation masked image modeling. Task and sub-task control is achieved by jointly conditioning on instruction tokens at every block, yielding fine-grained fusion and controllability (Li et al., 8 Dec 2025).

Audio: Multi-Scale Interfusion for Speaker Extraction

MC-SpEx utilizes “ScaleFuser” blocks that share weights across mixture and reference encoders. Multiple 1-D convolutional streams (small, mid, large window) are stacked and processed with weight-shared Conv2d+ELU layers, yielding consistent latent spaces for downstream fusion. This is paired with a ScaleInterMG mask generator for joint mask prediction over multi-scale features. Performance gains are directly linked to weight sharing and feature-space alignment across scales and modules (Chen et al., 2023).

3. Functional and Computational Properties

Domain	InterFusion Mechanism	Key Benefits
Neuroimaging	Per-element adaptive gating	Per-ROI weighting, improved regression/acc
Multimodal recognition	Bidirectional residual gated fusion	Symmetric exchange, parameter efficiency
Med. image fusion	Multi-scale ALSN-based skip fusion	Single-stage, robust to degradation
Vision-language	Joint transformer cross-attention	Unified semantics, fine-grained control
Audio	Weight-shared multi-scale fusers	Consistency, superior multi-scale extraction

Across settings, InterFusion modules eschew naive concatenation in favor of parameterized, content-adaptive operations. This yields robustness against input degradation or modality imbalance, direct controllability, and superior alignment.

4. Empirical Performance and Ablation Studies

Quantitative gains from InterFusion modules are consistently demonstrated across domains:

Neuroimaging: Gated fusion yields age MSE decrease of $20$-- $25\%$ versus unimodal or naive fusion, and $\approx 6$ --$9$\% gains in classification accuracy (Khajehnejad et al., 23 Jun 2025).
Emotion Recognition: Dual-stream InterFusion models outperform baselines by absolute margins aligned with prior research, with consistent improvements reported in cross-modal token-fusion ablations (Martirosyan et al., 29 Dec 2025).
Medical Imaging: Single-stage UFR-RF with ALSN reduces parameter counts by $\approx68\%$ , flops by orders of magnitude, and yields sharper, less artifact-prone results versus cascaded approaches (Su et al., 28 Jun 2025).
Diffusion Transformers: Task tags and multi-degradation masked-image modeling in DiTFuse’s InterFusion core deliver state-of-the-art metrics on IVIF and MFF, with ablation highlighting the necessity of instruction conditioning and diversity in degradation types (Li et al., 8 Dec 2025).
Speaker Extraction: ScaleFuser and ScaleInterMG modules jointly contribute more than $1$ dB SI-SDR improvement over previous SOTA, confirming the impact of multi-scale, tightly-coupled fusion (Chen et al., 2023).

A common finding is that simple concatenation or equal-weight averaging underperform substantially, while learned, context-driven InterFusion modules adaptively balance disparate sources.

5. Extensions and Open Directions

InterFusion modules are extensible along several axes:

Flexible Gates: Replace affine gates with convolutional or recurrent functions to incorporate local context (Martirosyan et al., 29 Dec 2025).
Attention-based Fusion: Substitute gating with cross-modal (multi-head) attention for token-wise or region-wise weighting, as in DiTFuse and transformer-based settings (Li et al., 8 Dec 2025).
Parameter-efficient Adaptation: LoRA and similar low-rank adapters provide robust, lightweight handling of distribution shift and degradation profile diversity (Su et al., 28 Jun 2025).
Hierarchical and Iterative Fusion: Stacked InterFusion blocks allow multiple stages of refinement and deeper cross-modal alignment, as demonstrated in both video and transformer realms (Martirosyan et al., 29 Dec 2025, Li et al., 8 Dec 2025).
Conditional and Controlled Fusion: Augment fusion with textual instructions or task tags, enabling not only flexible merging of input modalities, but explicit user steerability and hierarchical semantic control (Li et al., 8 Dec 2025).

6. Comparative Analysis and Implications

InterFusion modules, in their various instantiations, deliver substantial improvements over traditional staged or static fusion schemes by enabling parameter-efficient, context-aware, and often symmetric cross-modal interactions. Their adoption marks a convergence towards architectures that are both robust to real-world data corruption and adaptable for instruction-driven or user-conditioned multimodal applications. A plausible implication is that future multimodal systems across scientific and generative domains will increasingly rely on explicit InterFusion-like designs as the backbone for unified, controlled, and high-fidelity information integration.