Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 113 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

FuseMoE Framework for Fleximodal Fusion

Updated 19 August 2025
  • The paper introduces a Laplace-based gating mechanism that improves convergence and balances expert routing for irregular multimodal data.
  • FuseMoE employs modality-specific encoders with discretized multi-time attention to handle variable sampling and incomplete clinical records.
  • The framework robustly manages missing modalities through trainable indicators and flexible router designs, delivering state-of-the-art performance on ICU risk prediction tasks.

FuseMoE is a mixture-of-experts (MoE) fusion framework specifically designed to address the modeling challenges posed by fleximodal data—settings characterized by varied, incomplete, and irregularly sampled multimodal observations. FuseMoE's architecture combines specialized encoders for each modality, a robust discretized attention mechanism for handling sampling irregularity, and a novel Laplace-based gating mechanism in the MoE fusion layer. The framework achieves state-of-the-art performance on challenging clinical risk prediction tasks with complex electronic health record (EHR) data and demonstrates superior data efficiency and scalability relative to traditional multimodal fusion strategies.

1. Architectural Principles

FuseMoE is structured as a two-stage pipeline: modality-specific encoding followed by sparse MoE-based fusion. Each modality—such as continuous physiological time series, clinical text, images, or electrocardiograms—is processed by a dedicated encoder tailored to the data structure (e.g., sequence model for time series, transformer or CNN for text or imagery). The encoders regularize irregular or missing samples via discretization, mapping all observations onto a common temporal grid and generating fixed-length per-modality embeddings.

Formally, for input modalities M1,,MK\mathcal{M}_1, \dots, \mathcal{M}_K, each x(k)x^{(k)} is encoded to z(k)=fenc(k)(x(k))z^{(k)} = f^{(k)}_{\text{enc}}(x^{(k)}). For modalities with observation times {τj(k)}\{\tau_j^{(k)}\}, a multi-time attention encoder (mTAND) is employed:

ϕh(τj)[1]=w1τj,ϕh(τj)[i]=sin(wiτj+ϕi),(1<idh)\phi_h(\tau_j)[1] = w_1 \tau_j,\quad \phi_h(\tau_j)[i] = \sin(w_i \tau_j + \phi_i),\quad (1 < i \leq d_h)

These temporal features are aggregated into bins, and a linear projection maps to a shared embedding space of dimension ded_e.

The MoE fusion layer receives these embeddings and employs a set of NN expert networks E1,,ENE_1,\dots,E_N. For each combined embedding xx, the final output is:

y=i=1NG(x)iEi(x)y = \sum_{i=1}^N G(x)_i E_i(x)

where G(x)iG(x)_i is the routing weight assigned by the gating function. Auxiliary entropy regularization ensures balanced expert utilization.

2. Laplace-Based Gating Mechanism

A distinctive feature of FuseMoE is the Laplace gating function, which replaces the conventional softmax gating. The standard gating applies:

G(x)i=exp(hi(x))jexp(hj(x))G(x)_i = \frac{\exp(h_i(x))}{\sum_j \exp(h_j(x))}

where hi(x)=xTWih_i(x) = x^T W_i is a linear projection. FuseMoE's Laplace gate instead computes:

h(x)=Wx2h(x) = -\| W - x \|_2

G(x)iexp(Wix2)G(x)_i \propto \exp(-\| W_i - x \|_2)

Selection is performed by top-KK gating: the KK experts closest to xx under the 2\ell_2 norm are activated. The Laplace mechanism is less prone to degenerate routing (e.g., all input routed to a single expert) than the softmax, leading to more balanced expert usages. Theoretical analysis in the paper demonstrates improved convergence for parameter estimation (with rates O(n1/4)O(n^{-1/4}) for the Laplace gate versus O(n1/2)O(n^{-1/2}) for softmax in some regimes, and conditional density convergence O(logn/n)O(\sqrt{\log n / n})), facilitating more rapid and stable learning.

3. Handling Missing Modalities and Irregularity

FuseMoE is specifically engineered for the challenges of fleximodal data:

  • Arbitrary Modalities: Each data source is encoded and aligned independently, making it possible to handle heterogenous modality sets and even entirely missing modalities for some samples.
  • Irregular Sampling: Discretized mTAND modules aggregate sampled observations into fixed bins, handling both sparsity and irregular time intervals.
  • Missing Data: Missing modalities are denoted by a trainable "missing indicator" embedding, ensuring that their absence does not introduce bias in the fusion. Entropy regularization in the routing encourages the model to avoid routing missing-token embeddings to the same experts as present modalities.

Multiple router architectures are covered: joint routing across all modalities, modality-specific routers with shared expert pools, and non-overlapping expert pools per modality. This flexibility allows practitioners to balance inter-modal and intra-modal interactions according to the data structure and target task.

4. Performance, Scalability, and Training Properties

Empirical studies using MIMIC-III and MIMIC-IV datasets show FuseMoE achieving superior AUROC and F1 scores compared to HAIM, MISTS, MulT, TF, and MAG on ICU tasks of mortality, length of stay, and phenotype classification. Ablation studies indicate performance increases as the number of experts grows, ultimately saturating beyond 16 experts (with top-4 gating).

The Laplace gating mechanism also empirically provides better predictive performance and more stable learning dynamics than softmax-based gating. Sparse MoE activation ensures that the computational cost is constrained, since only a small subset of all possible experts are evaluated per input token.

From a theoretical perspective, the convergence and parameter estimation rates reflect improved sample efficiency, and the choice of gating function can be directly connected to convergence theorems proved in the paper.

5. Real-World Applications

FuseMoE’s methodology targets domains where data is highly variable and often incomplete, especially in healthcare. The framework is validated on risk prediction tasks using complex EHRs containing:

  • Vital/laboratory time series with missing values and irregular timings,
  • Clinical notes (free-text),
  • Imaging data (e.g., chest X-rays),
  • ECG signals.

Critically, not all patients have all modalities or even the same observation frequencies. Prediction targets include 48-hour in-hospital mortality, ICU length of stay, and multi-phenotype classification (25 classes). The framework's flexible fusion, missingness-robust architecture, and theoretical efficiency deliver improved risk scores and classification accuracy.

6. Comparative and Methodological Context

FuseMoE extends beyond traditional concatenative, cross-modal attention, or tensor fusion methods in both scalability and flexibility. Table 1 in the paper highlights performance and convergence rate improvements over direct competitors:

Framework Missing Modalities Scalability Gate Function Key Result
FuseMoE Yes Arbitrarily Large Laplace Strong AUROC, Fast Convergence
HAIM No Limited Only feature concatenation
MISTS No Pairwise Limited Modalities
MulT/TF No Poor Cross-attention is expensive

Key design innovations such as the Laplace gate, entropy-regularized routing, and flexible router/expert pool configuration distinguish FuseMoE from prior multi-modal fusion architectures.

7. Implications and Future Extensions

The theoretical and empirical advances in gating, sparsity, and modality-handling introduced by FuseMoE suggest broad applicability in domains that confront incomplete and heterogeneous sensor data. The ability to control inter- and intra-modal fusion granularity, leverage missingness-aware encoding, and improve convergence rates may benefit climate science, finance, and other complex systems with similar data challenges.

A plausible implication is that further developments could extend Laplace-style gating to dynamic multi-modal environments beyond healthcare, and that the architectural modularity positions FuseMoE as a scalable backbone for future work on real-time multimodal fusion and decision support systems.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube