Papers
Topics
Authors
Recent
2000 character limit reached

Missing-aware Mixture-of-Loras (MaMOL)

Updated 21 November 2025
  • The paper introduces a dual-routing mixture-of-experts strategy that reformulates missing modalities as a multi-task problem, enabling robust, unified inference.
  • It integrates lightweight LoRA modules within a frozen Transformer backbone to achieve substantial computational savings and reliable performance.
  • Empirical results on remote sensing datasets demonstrate MaMOL's superiority, maintaining high overall accuracy even at extreme missing rates.

Missing-aware Mixture-of-Loras (MaMOL) is a parameter-efficient multimodal learning framework developed to address the challenge of modality-missing classification, particularly within remote sensing. It reformulates the presence of missing modalities as a multi-task learning problem and introduces a dual-routing mixture-of-experts strategy. MaMOL uniquely combines dynamic, pattern-aware expert routing with stable, modality-aware knowledge sharing using low-rank adaptation (LoRA) modules, enabling robust inference regardless of which modalities are present at train or test time. This architecture achieves substantial computational savings and improved generalization relative to prior methods reliant on fully fine-tuned or pattern-specific networks (Gao et al., 14 Nov 2025).

1. Multi-Task Formulation of Modality Missing

MaMOL conceptualizes every combination of observed/missing modalities as a distinct classification task. With MM modalities (e.g., optical, SAR, LiDAR, hyperspectral), each input is modeled as an MM-tuple:

x=(x(1),…,x(M)),x = (x^{(1)}, \ldots, x^{(M)}),

where x(m)x^{(m)} is the actual observation if present or a learnable dummy placeholder x~(m)\tilde x^{(m)} if modality mm is missing. The dataset is partitioned into subsets {Dp}p=1P\{D^p\}_{p=1}^P where each pp indexes a unique binary presence/absence pattern. For example, with two modalities,

  • Dc={(xm1,xm2,y)}D^c = \{(x^{m_1}, x^{m_2}, y)\} (both present)
  • Dm1={(xm1,x~m2,y)}D^{m_1} = \{(x^{m_1}, \tilde x^{m_2}, y)\} (only m1m_1 present)
  • Dm2={(x~m1,xm2,y)}D^{m_2} = \{(\tilde x^{m_1}, x^{m_2}, y)\} (only m2m_2 present)

The target is a model fθf_\theta that predicts correctly for any pattern pp and x∈Dpx \in D^p, eliminating the need for retraining or multiple models.

2. Model Architecture and Expert Routing

MaMOL employs a frozen, pretrained ViT-style Transformer backbone (e.g., CLIP ViT-B/16) augmented with lightweight LoRA modules (termed "experts") in LL selected Transformer blocks. Each expert modifies only the feed-forward component via low-rank updates. The architecture incorporates two residual pathways per layer:

hlayer=LayerNorm(hfrozen+Δhdyn+Δhstat)h_{\text{layer}} = \mathrm{LayerNorm}(h_{\text{frozen}} + \Delta h_{\text{dyn}} + \Delta h_{\text{stat}})

2.1 Dynamic Router ("Task-Oriented")

A set of NN low-rank pattern experts {fk}\{f_k\} (where fk(z)=Bd(k)Ad(k)zf_k(z) = B_d^{(k)} A_d^{(k)} z with Ad(k)∈Rd×r,Bd(k)∈Rr×d,r≪dA_d^{(k)} \in \mathbb{R}^{d\times r}, B_d^{(k)}\in\mathbb{R}^{r\times d}, r \ll d) is maintained. The dynamic router RdynR_\mathrm{dyn} receives the hidden feature zz and a one-hot or learned encoding mtype∈RPm_{\text{type}}\in\mathbb{R}^P of the missing-pattern:

gt=softmax(Wtft([z;mtype]))g_t = \mathrm{softmax}(W_t f_t([z; m_{\text{type}}]))

Sparse gating is enforced by activating only the top-KK entries in gtg_t. The dynamic residual is

Δhdyn=∑k=1Kgt(k)fk(z)\Delta h_{\text{dyn}} = \sum_{k=1}^K g_t^{(k)} f_k(z)

2.2 Static Router ("Modality-Specific-Shared")

KsK_s static experts {Es(j)}\{E_s^{(j)}\} (with their own LoRA parameters) are activated using fixed coefficients s(j)s^{(j)} per present modality or globally:

  • s(j)=1s^{(j)} = 1 if expert jj is shared or matches a present modality,
  • s(j)=0s^{(j)} = 0 otherwise.

The static residual is

Δhstat=∑j=1Kss(j)Bs(j)As(j)z\Delta h_{\text{stat}} = \sum_{j=1}^{K_s} s^{(j)} B_s^{(j)} A_s^{(j)} z

3. LoRA Expert Integration and Inference Workflow

All expert modifications rely on low-rank adaptation, such that for a feed-forward weight W0∈Rd×dW_0\in \mathbb{R}^{d\times d}, the update is

Wnew=W0+ΔW,ΔW=BAW_\text{new} = W_0 + \Delta W,\quad \Delta W = B A

where A∈Rd×r,B∈Rr×dA\in\mathbb{R}^{d\times r}, B\in\mathbb{R}^{r\times d}. Inference at each designated layer proceeds as:

  1. Compute z=hfrozen(â„“)z = h_{\text{frozen}}^{(\ell)}
  2. Obtain dynamic gating vector gtg_t; select top-KK experts
  3. Compute Δhdyn\Delta h_{\text{dyn}} and Δhstat\Delta h_{\text{stat}}
  4. Form output hâ„“+1h_{\ell+1} via LayerNorm as above

4. Training Objective, Parameter Efficiency, and Optimization

The learning objective treats each missing-modality pattern as a task, minimizing aggregated cross-entropy loss:

Ltotal(θ)=∑p=1PE(x,y)∈Dp[CE(fθ(x),y)]L_\text{total}(\theta) = \sum_{p=1}^P \mathbb{E}_{(x,y)\in D^p}[ \text{CE}(f_\theta(x), y) ]

Batches include a fraction η\eta of incomplete-pattern samples for better generalization. Only the following parameters are updated: expert matrices ({Ad(k),Bd(k),As(j),Bs(j)}\{A_d^{(k)}, B_d^{(k)}, A_s^{(j)}, B_s^{(j)}\}), router (Wt,ftW_t, f_t), and classification head; backbone weights θbackbone\theta_\text{backbone} remain frozen. Optional ℓ2\ell_2 regularization can be applied. The corresponding training pseudocode specifies per-sample pattern tracking, expert routing, and modular forward passes.

MaMOL’s trainable parameter count scales sublinearly in the number of patterns:

Total≈(#dyn experts+#static experts)⋅2dr≪d2L\text{Total} \approx (\#\text{dyn experts} + \#\text{static experts}) \cdot 2 d r \ll d^2 L

Sparse top-KK gating ensures low compute, yielding parameter and FLOP savings compared to per-pattern architectures.

5. Empirical Performance and Benchmarks

MaMOL was evaluated on several multimodal remote sensing datasets (Houston2013: HS+LiDAR; Trento: HS+LiDAR; Augsburg: HS+SAR+LiDAR) and under various missing rates (η∈{50%,70%,90%}\eta \in \{50\%, 70\%, 90\%\}) with split configurations of fully complete and incomplete data. For example, on Houston2013 at 50%50\% missing and 100%/50%100\%/50\% split, Overall Accuracy (OA%):

  • MMP: 91.56/90.95
  • DCP: 97.60/97.43
  • MaMOL: 98.40/98.29

At extreme 90%90\% missing rates, MaMOL sustains OA above 98.5%98.5\%. On Augsburg with 75%75\% missing and three modalities, MaMOL achieved 80.10%80.10\% OA, outperforming strong baselines by ∼2%\sim2\%.

Transfer to natural images on MM-IMDb with MaMOL-3.6M parameters yielded 55.54%55.54\% F1-macro compared to the best baseline (DCP) at approximately 51.95%51.95\%.

Ablation studies confirm dynamic experts are key for adaptation to pattern shifts, static experts stabilize learning, and modality-specialized experts encode fine-grained priors. Models generalize to unseen patterns due to the multi-task routing mechanism.

6. Computational Efficiency and Scalability

Inserting experts into six Transformer layers (two dynamic and two static per layer) increases trainable parameters by less than 1%1\% and inference FLOPs by less than 5%5\% over adapter-only baselines. This modest overhead yields over 1%1\% absolute gain in OA. Compared to training PP separate models (where parameter cost is P×P\times number of model parameters), MaMOL requires a shared backbone plus small LoRA expert heads, realizing >95%>95\% parameter savings for large PP.

The static and dynamic routing infrastructure, combined with low-rank updates, enables both extensibility to M>2M>2 modalities and robust adaptation under practical missingness, while the sparse gating preserves efficient compute budgets (Gao et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Missing-aware Mixture-of-Loras (MaMOL).