Missing-aware Mixture-of-Loras (MaMOL)

Updated 21 November 2025

The paper introduces a dual-routing mixture-of-experts strategy that reformulates missing modalities as a multi-task problem, enabling robust, unified inference.
It integrates lightweight LoRA modules within a frozen Transformer backbone to achieve substantial computational savings and reliable performance.
Empirical results on remote sensing datasets demonstrate MaMOL's superiority, maintaining high overall accuracy even at extreme missing rates.

Missing-aware Mixture-of-Loras (MaMOL) is a parameter-efficient multimodal learning framework developed to address the challenge of modality-missing classification, particularly within remote sensing. It reformulates the presence of missing modalities as a multi-task learning problem and introduces a dual-routing mixture-of-experts strategy. MaMOL uniquely combines dynamic, pattern-aware expert routing with stable, modality-aware knowledge sharing using low-rank adaptation (LoRA) modules, enabling robust inference regardless of which modalities are present at train or test time. This architecture achieves substantial computational savings and improved generalization relative to prior methods reliant on fully fine-tuned or pattern-specific networks (Gao et al., 14 Nov 2025).

1. Multi-Task Formulation of Modality Missing

MaMOL conceptualizes every combination of observed/missing modalities as a distinct classification task. With $M$ modalities (e.g., optical, SAR, LiDAR, hyperspectral), each input is modeled as an $M$ -tuple:

$x = (x^{(1)}, \ldots, x^{(M)}),$

where $x^{(m)}$ is the actual observation if present or a learnable dummy placeholder $\tilde x^{(m)}$ if modality $m$ is missing. The dataset is partitioned into subsets $\{D^p\}_{p=1}^P$ where each $p$ indexes a unique binary presence/absence pattern. For example, with two modalities,

$D^c = \{(x^{m_1}, x^{m_2}, y)\}$ (both present)
$D^{m_1} = \{(x^{m_1}, \tilde x^{m_2}, y)\}$ (only $m_1$ present)
$D^{m_2} = \{(\tilde x^{m_1}, x^{m_2}, y)\}$ (only $m_2$ present)

The target is a model $f_\theta$ that predicts correctly for any pattern $p$ and $x \in D^p$ , eliminating the need for retraining or multiple models.

2. Model Architecture and Expert Routing

MaMOL employs a frozen, pretrained ViT-style Transformer backbone (e.g., CLIP ViT-B/16) augmented with lightweight LoRA modules (termed "experts") in $L$ selected Transformer blocks. Each expert modifies only the feed-forward component via low-rank updates. The architecture incorporates two residual pathways per layer:

$h_{\text{layer}} = \mathrm{LayerNorm}(h_{\text{frozen}} + \Delta h_{\text{dyn}} + \Delta h_{\text{stat}})$

2.1 Dynamic Router ("Task-Oriented")

A set of $N$ low-rank pattern experts $\{f_k\}$ (where $f_k(z) = B_d^{(k)} A_d^{(k)} z$ with $A_d^{(k)} \in \mathbb{R}^{d\times r}, B_d^{(k)}\in\mathbb{R}^{r\times d}, r \ll d$ ) is maintained. The dynamic router $R_\mathrm{dyn}$ receives the hidden feature $z$ and a one-hot or learned encoding $m_{\text{type}}\in\mathbb{R}^P$ of the missing-pattern:

$g_t = \mathrm{softmax}(W_t f_t([z; m_{\text{type}}]))$

Sparse gating is enforced by activating only the top- $K$ entries in $g_t$ . The dynamic residual is

$\Delta h_{\text{dyn}} = \sum_{k=1}^K g_t^{(k)} f_k(z)$

2.2 Static Router ("Modality-Specific-Shared")

$K_s$ static experts $\{E_s^{(j)}\}$ (with their own LoRA parameters) are activated using fixed coefficients $s^{(j)}$ per present modality or globally:

$s^{(j)} = 1$ if expert $j$ is shared or matches a present modality,
$s^{(j)} = 0$ otherwise.

The static residual is

$\Delta h_{\text{stat}} = \sum_{j=1}^{K_s} s^{(j)} B_s^{(j)} A_s^{(j)} z$

3. LoRA Expert Integration and Inference Workflow

All expert modifications rely on low-rank adaptation, such that for a feed-forward weight $W_0\in \mathbb{R}^{d\times d}$ , the update is

$W_\text{new} = W_0 + \Delta W,\quad \Delta W = B A$

where $A\in\mathbb{R}^{d\times r}, B\in\mathbb{R}^{r\times d}$ . Inference at each designated layer proceeds as:

Compute $z = h_{\text{frozen}}^{(\ell)}$
Obtain dynamic gating vector $g_t$ ; select top- $K$ experts
Compute $\Delta h_{\text{dyn}}$ and $\Delta h_{\text{stat}}$
Form output $h_{\ell+1}$ via LayerNorm as above

4. Training Objective, Parameter Efficiency, and Optimization

The learning objective treats each missing-modality pattern as a task, minimizing aggregated cross-entropy loss:

$L_\text{total}(\theta) = \sum_{p=1}^P \mathbb{E}_{(x,y)\in D^p}[ \text{CE}(f_\theta(x), y) ]$

Batches include a fraction $\eta$ of incomplete-pattern samples for better generalization. Only the following parameters are updated: expert matrices ( $\{A_d^{(k)}, B_d^{(k)}, A_s^{(j)}, B_s^{(j)}\}$ ), router ( $W_t, f_t$ ), and classification head; backbone weights $\theta_\text{backbone}$ remain frozen. Optional $\ell_2$ regularization can be applied. The corresponding training pseudocode specifies per-sample pattern tracking, expert routing, and modular forward passes.

MaMOL’s trainable parameter count scales sublinearly in the number of patterns:

$\text{Total} \approx (\#\text{dyn experts} + \#\text{static experts}) \cdot 2 d r \ll d^2 L$

Sparse top- $K$ gating ensures low compute, yielding parameter and FLOP savings compared to per-pattern architectures.

5. Empirical Performance and Benchmarks

MaMOL was evaluated on several multimodal remote sensing datasets (Houston2013: HS+LiDAR; Trento: HS+LiDAR; Augsburg: HS+SAR+LiDAR) and under various missing rates ( $\eta \in \{50\%, 70\%, 90\%\}$ ) with split configurations of fully complete and incomplete data. For example, on Houston2013 at $50\%$ missing and $100\%/50\%$ split, Overall Accuracy (OA%):

MMP: 91.56/90.95
DCP: 97.60/97.43
MaMOL: 98.40/98.29

At extreme $90\%$ missing rates, MaMOL sustains OA above $98.5\%$ . On Augsburg with $75\%$ missing and three modalities, MaMOL achieved $80.10\%$ OA, outperforming strong baselines by $\sim2\%$ .

Transfer to natural images on MM-IMDb with MaMOL-3.6M parameters yielded $55.54\%$ F1-macro compared to the best baseline (DCP) at approximately $51.95\%$ .

Ablation studies confirm dynamic experts are key for adaptation to pattern shifts, static experts stabilize learning, and modality-specialized experts encode fine-grained priors. Models generalize to unseen patterns due to the multi-task routing mechanism.

6. Computational Efficiency and Scalability

Inserting experts into six Transformer layers (two dynamic and two static per layer) increases trainable parameters by less than $1\%$ and inference FLOPs by less than $5\%$ over adapter-only baselines. This modest overhead yields over $1\%$ absolute gain in OA. Compared to training $P$ separate models (where parameter cost is $P\times$ number of model parameters), MaMOL requires a shared backbone plus small LoRA expert heads, realizing $>95\%$ parameter savings for large $P$ .

The static and dynamic routing infrastructure, combined with low-rank updates, enables both extensibility to $M>2$ modalities and robust adaptation under practical missingness, while the sparse gating preserves efficient compute budgets (Gao et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Missing-aware Mixture-of-Loras (MaMOL).