FuseMoE Framework for Fleximodal Fusion

Updated 19 August 2025

The paper introduces a Laplace-based gating mechanism that improves convergence and balances expert routing for irregular multimodal data.
FuseMoE employs modality-specific encoders with discretized multi-time attention to handle variable sampling and incomplete clinical records.
The framework robustly manages missing modalities through trainable indicators and flexible router designs, delivering state-of-the-art performance on ICU risk prediction tasks.

FuseMoE is a mixture-of-experts (MoE) fusion framework specifically designed to address the modeling challenges posed by fleximodal data—settings characterized by varied, incomplete, and irregularly sampled multimodal observations. FuseMoE's architecture combines specialized encoders for each modality, a robust discretized attention mechanism for handling sampling irregularity, and a novel Laplace-based gating mechanism in the MoE fusion layer. The framework achieves state-of-the-art performance on challenging clinical risk prediction tasks with complex electronic health record (EHR) data and demonstrates superior data efficiency and scalability relative to traditional multimodal fusion strategies.

1. Architectural Principles

FuseMoE is structured as a two-stage pipeline: modality-specific encoding followed by sparse MoE-based fusion. Each modality—such as continuous physiological time series, clinical text, images, or electrocardiograms—is processed by a dedicated encoder tailored to the data structure (e.g., sequence model for time series, transformer or CNN for text or imagery). The encoders regularize irregular or missing samples via discretization, mapping all observations onto a common temporal grid and generating fixed-length per-modality embeddings.

Formally, for input modalities $\mathcal{M}_1, \dots, \mathcal{M}_K$ , each $x^{(k)}$ is encoded to $z^{(k)} = f^{(k)}_{\text{enc}}(x^{(k)})$ . For modalities with observation times $\{\tau_j^{(k)}\}$ , a multi-time attention encoder (mTAND) is employed:

$\phi_h(\tau_j)[1] = w_1 \tau_j,\quad \phi_h(\tau_j)[i] = \sin(w_i \tau_j + \phi_i),\quad (1 < i \leq d_h)$

These temporal features are aggregated into bins, and a linear projection maps to a shared embedding space of dimension $d_e$ .

The MoE fusion layer receives these embeddings and employs a set of $N$ expert networks $E_1,\dots,E_N$ . For each combined embedding $x$ , the final output is:

$y = \sum_{i=1}^N G(x)_i E_i(x)$

where $G(x)_i$ is the routing weight assigned by the gating function. Auxiliary entropy regularization ensures balanced expert utilization.

2. Laplace-Based Gating Mechanism

A distinctive feature of FuseMoE is the Laplace gating function, which replaces the conventional softmax gating. The standard gating applies:

$G(x)_i = \frac{\exp(h_i(x))}{\sum_j \exp(h_j(x))}$

where $h_i(x) = x^T W_i$ is a linear projection. FuseMoE's Laplace gate instead computes:

$h(x) = -\| W - x \|_2$

$G(x)_i \propto \exp(-\| W_i - x \|_2)$

Selection is performed by top- $K$ gating: the $K$ experts closest to $x$ under the $\ell_2$ norm are activated. The Laplace mechanism is less prone to degenerate routing (e.g., all input routed to a single expert) than the softmax, leading to more balanced expert usages. Theoretical analysis in the paper demonstrates improved convergence for parameter estimation (with rates $O(n^{-1/4})$ for the Laplace gate versus $O(n^{-1/2})$ for softmax in some regimes, and conditional density convergence $O(\sqrt{\log n / n})$ ), facilitating more rapid and stable learning.

3. Handling Missing Modalities and Irregularity

FuseMoE is specifically engineered for the challenges of fleximodal data:

Arbitrary Modalities: Each data source is encoded and aligned independently, making it possible to handle heterogenous modality sets and even entirely missing modalities for some samples.
Irregular Sampling: Discretized mTAND modules aggregate sampled observations into fixed bins, handling both sparsity and irregular time intervals.
Missing Data: Missing modalities are denoted by a trainable "missing indicator" embedding, ensuring that their absence does not introduce bias in the fusion. Entropy regularization in the routing encourages the model to avoid routing missing-token embeddings to the same experts as present modalities.

Multiple router architectures are covered: joint routing across all modalities, modality-specific routers with shared expert pools, and non-overlapping expert pools per modality. This flexibility allows practitioners to balance inter-modal and intra-modal interactions according to the data structure and target task.

4. Performance, Scalability, and Training Properties

Empirical studies using MIMIC-III and MIMIC-IV datasets show FuseMoE achieving superior AUROC and F1 scores compared to HAIM, MISTS, MulT, TF, and MAG on ICU tasks of mortality, length of stay, and phenotype classification. Ablation studies indicate performance increases as the number of experts grows, ultimately saturating beyond 16 experts (with top-4 gating).

The Laplace gating mechanism also empirically provides better predictive performance and more stable learning dynamics than softmax-based gating. Sparse MoE activation ensures that the computational cost is constrained, since only a small subset of all possible experts are evaluated per input token.

From a theoretical perspective, the convergence and parameter estimation rates reflect improved sample efficiency, and the choice of gating function can be directly connected to convergence theorems proved in the paper.

5. Real-World Applications

FuseMoE’s methodology targets domains where data is highly variable and often incomplete, especially in healthcare. The framework is validated on risk prediction tasks using complex EHRs containing:

Vital/laboratory time series with missing values and irregular timings,
Clinical notes (free-text),
Imaging data (e.g., chest X-rays),
ECG signals.

Critically, not all patients have all modalities or even the same observation frequencies. Prediction targets include 48-hour in-hospital mortality, ICU length of stay, and multi-phenotype classification (25 classes). The framework's flexible fusion, missingness-robust architecture, and theoretical efficiency deliver improved risk scores and classification accuracy.

6. Comparative and Methodological Context

FuseMoE extends beyond traditional concatenative, cross-modal attention, or tensor fusion methods in both scalability and flexibility. Table 1 in the paper highlights performance and convergence rate improvements over direct competitors:

Framework	Missing Modalities	Scalability	Gate Function	Key Result
FuseMoE	Yes	Arbitrarily Large	Laplace	Strong AUROC, Fast Convergence
HAIM	No	Limited	—	Only feature concatenation
MISTS	No	Pairwise	—	Limited Modalities
MulT/TF	No	Poor	—	Cross-attention is expensive

Key design innovations such as the Laplace gate, entropy-regularized routing, and flexible router/expert pool configuration distinguish FuseMoE from prior multi-modal fusion architectures.

7. Implications and Future Extensions

The theoretical and empirical advances in gating, sparsity, and modality-handling introduced by FuseMoE suggest broad applicability in domains that confront incomplete and heterogeneous sensor data. The ability to control inter- and intra-modal fusion granularity, leverage missingness-aware encoding, and improve convergence rates may benefit climate science, finance, and other complex systems with similar data challenges.

A plausible implication is that further developments could extend Laplace-style gating to dynamic multi-modal environments beyond healthcare, and that the architectural modularity positions FuseMoE as a scalable backbone for future work on real-time multimodal fusion and decision support systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to FuseMoE Framework.