Missing-aware Mixture-of-Loras (MaMOL)
- The paper introduces a dual-routing mixture-of-experts strategy that reformulates missing modalities as a multi-task problem, enabling robust, unified inference.
- It integrates lightweight LoRA modules within a frozen Transformer backbone to achieve substantial computational savings and reliable performance.
- Empirical results on remote sensing datasets demonstrate MaMOL's superiority, maintaining high overall accuracy even at extreme missing rates.
Missing-aware Mixture-of-Loras (MaMOL) is a parameter-efficient multimodal learning framework developed to address the challenge of modality-missing classification, particularly within remote sensing. It reformulates the presence of missing modalities as a multi-task learning problem and introduces a dual-routing mixture-of-experts strategy. MaMOL uniquely combines dynamic, pattern-aware expert routing with stable, modality-aware knowledge sharing using low-rank adaptation (LoRA) modules, enabling robust inference regardless of which modalities are present at train or test time. This architecture achieves substantial computational savings and improved generalization relative to prior methods reliant on fully fine-tuned or pattern-specific networks (Gao et al., 14 Nov 2025).
1. Multi-Task Formulation of Modality Missing
MaMOL conceptualizes every combination of observed/missing modalities as a distinct classification task. With modalities (e.g., optical, SAR, LiDAR, hyperspectral), each input is modeled as an -tuple:
where is the actual observation if present or a learnable dummy placeholder if modality is missing. The dataset is partitioned into subsets where each indexes a unique binary presence/absence pattern. For example, with two modalities,
- (both present)
- (only present)
- (only present)
The target is a model that predicts correctly for any pattern and , eliminating the need for retraining or multiple models.
2. Model Architecture and Expert Routing
MaMOL employs a frozen, pretrained ViT-style Transformer backbone (e.g., CLIP ViT-B/16) augmented with lightweight LoRA modules (termed "experts") in selected Transformer blocks. Each expert modifies only the feed-forward component via low-rank updates. The architecture incorporates two residual pathways per layer:
2.1 Dynamic Router ("Task-Oriented")
A set of low-rank pattern experts (where with ) is maintained. The dynamic router receives the hidden feature and a one-hot or learned encoding of the missing-pattern:
Sparse gating is enforced by activating only the top- entries in . The dynamic residual is
2.2 Static Router ("Modality-Specific-Shared")
static experts (with their own LoRA parameters) are activated using fixed coefficients per present modality or globally:
- if expert is shared or matches a present modality,
- otherwise.
The static residual is
3. LoRA Expert Integration and Inference Workflow
All expert modifications rely on low-rank adaptation, such that for a feed-forward weight , the update is
where . Inference at each designated layer proceeds as:
- Compute
- Obtain dynamic gating vector ; select top- experts
- Compute and
- Form output via LayerNorm as above
4. Training Objective, Parameter Efficiency, and Optimization
The learning objective treats each missing-modality pattern as a task, minimizing aggregated cross-entropy loss:
Batches include a fraction of incomplete-pattern samples for better generalization. Only the following parameters are updated: expert matrices (), router (), and classification head; backbone weights remain frozen. Optional regularization can be applied. The corresponding training pseudocode specifies per-sample pattern tracking, expert routing, and modular forward passes.
MaMOL’s trainable parameter count scales sublinearly in the number of patterns:
Sparse top- gating ensures low compute, yielding parameter and FLOP savings compared to per-pattern architectures.
5. Empirical Performance and Benchmarks
MaMOL was evaluated on several multimodal remote sensing datasets (Houston2013: HS+LiDAR; Trento: HS+LiDAR; Augsburg: HS+SAR+LiDAR) and under various missing rates () with split configurations of fully complete and incomplete data. For example, on Houston2013 at missing and split, Overall Accuracy (OA%):
- MMP: 91.56/90.95
- DCP: 97.60/97.43
- MaMOL: 98.40/98.29
At extreme missing rates, MaMOL sustains OA above . On Augsburg with missing and three modalities, MaMOL achieved OA, outperforming strong baselines by .
Transfer to natural images on MM-IMDb with MaMOL-3.6M parameters yielded F1-macro compared to the best baseline (DCP) at approximately .
Ablation studies confirm dynamic experts are key for adaptation to pattern shifts, static experts stabilize learning, and modality-specialized experts encode fine-grained priors. Models generalize to unseen patterns due to the multi-task routing mechanism.
6. Computational Efficiency and Scalability
Inserting experts into six Transformer layers (two dynamic and two static per layer) increases trainable parameters by less than and inference FLOPs by less than over adapter-only baselines. This modest overhead yields over absolute gain in OA. Compared to training separate models (where parameter cost is number of model parameters), MaMOL requires a shared backbone plus small LoRA expert heads, realizing parameter savings for large .
The static and dynamic routing infrastructure, combined with low-rank updates, enables both extensibility to modalities and robust adaptation under practical missingness, while the sparse gating preserves efficient compute budgets (Gao et al., 14 Nov 2025).