SMIL: Multimodal Learning under Missing Modalities

Updated 2 March 2026

The paper introduces a Bayesian meta-learning framework that reconstructs missing modalities using latent-space imputation and uncertainty-guided regularization.
It employs a meta-learned approach combining reconstruction networks and modified MAML to adapt to diverse and severe missingness patterns.
Empirical results show improved accuracy and graceful degradation in tasks such as medical imaging and sensor fusion, even with >90% missing modalities.

A severe missing modality regime in multimodal learning arises when the majority of both training and testing samples lack one or more modalities, challenging canonical models that rely on complete input tuples. In this context, "SMIL"—Severely Missing modality IMbedding via meta-Learning—not only refers to the original Bayesian meta-learning framework (Ma et al., 2021), but serves as a foundational concept for a class of approaches designed to operate under extreme missingness. These approaches aim to realize flexible, robust, task-adaptive learning that remains performant as paired observations become vanishingly sparse.

1. Problem Scope and Motivation

Multimodal learning with severely missing modalities occupies a unique operational space where the ratio of modality-incomplete to modality-complete samples is exceedingly high, often $>90\%$ (Ma et al., 2021, Zhao et al., 2024). In typical real-world scenarios—medical imaging with missing MRI contrasts, sensor fusion with faulty hardware, or large-scale web data integration—complete signal coverage is unrealistic. Key desiderata for SMIL-class methods include:

Flexibility: Seamless handling of arbitrary missingness patterns at both train and test time, without assuming fixed or task-specific ad hoc imputation protocols.
Efficiency: Retention of high accuracy and generalization even as the full-modality anchor set diminishes.

Traditional imputation via autoencoders, GANs, or partial-modality distillation degrade steeply as the missingness grows, primarily due to their dependence on dense paired coverage or the need for strong unimodal priors (Ma et al., 2021, Zhang et al., 2024). SMIL frameworks are developed to circumvent these limitations by designing meta-learned, uncertainty-aware, and structure-exploiting representations.

2. Core SMIL Framework: Bayesian Meta-Learning and Latent-Space Reconstruction

The original SMIL approach (Ma et al., 2021) is based on a hierarchical Bayesian meta-learning paradigm. The training set is split into a small fully paired subset $\mathcal{D}^f = \{(x_i^1, x_i^2, y_i)\}_{i=1}^n$ and a substantially larger incomplete subset $\mathcal{D}^m = \{(x_j^1, y_j)\}_{j=1}^m$ . The model jointly optimizes:

$f_\theta$ : main multimodal predictor,
$q_{\phi_c}(\omega|x^1)$ : a feature reconstruction network for the missing modality, parameterized via a Gaussian posterior over mixture weights,
$q_{\phi_r}(r|h^{l-1})$ : a latent regularization network acting on hidden layer activations.

The overall generative-variational model is: $p(y, z|x) = p(z)\, p(y|x, z;\theta)$ with variational approximation $q(z|x;\psi)$ , where $z$ collects all latent variables. The learning goal is to maximize the ELBO: $\mathcal{L}(\theta, \psi) = \mathbb{E}_{q(z|X;\psi)}\left[\log p(Y|X, z; \theta)\right] - \mathrm{KL}(q(z|X; \psi)\,\|\,p(z))$ with $q_{\phi_c}$ reconstructing missing modalities by sampling mixture weights and projecting them onto a compact set of cluster priors (learned from the scarce complete data via K-means).

The meta-learning update schema employs a modified MAML routine. The model is updated with incomplete batches to encode learning under missingness (inner loop), then validated/regularized on the (tiny) complete set to calibrate both $\theta$ and $\psi$ (outer loop), enforcing robustness to missing-pattern shifts at deployment.

Uncertainty-guided regularization is imposed on hidden representations to mitigate overfitting to the incomplete regime. Perturbation variables $r$ are sampled and injected into each hidden layer: $r \sim \mathcal{N}(\mu, \sigma),\quad (\mu, \sigma) = f_{\phi_r}(h^{l-1}),\quad h^l \leftarrow h^l \circ \mathrm{Softplus}(r)$

This approach yields:

Latent-space imputation: reconstruction operates in low-dimensional feature space, sidestepping the ill-posedness of pixel/word-level completion.
Flexibility and statistical efficiency: the framework is valid across any missing pattern at train or test time, grounded in Bayesian principles rather than case-specific heuristics (Ma et al., 2021).

3. Extensions and Advanced SMIL-Style Architectures

The SMIL principle—robust meta-learned or self-supervised adaptation to missingness—forms the foundation for several advanced architectures:

3.1 Parameter-Efficient SMIL via Joint Embedding and Prompt Prediction

Recent generalizations (Kim et al., 2024) replace explicit meta-learning with joint-embedding of pretrained unimodal encoders, combined with parameter-efficient fine-tuning (PEFT). For modalities $m = 1, \dots, M$ , embeddings $z^m = f^m_{\mathrm{enc}}(x^m)$ are aligned via self-supervised VICReg/contrastive objectives across any available pairs. Missing modalities are predicted in representation space via a prompt-tuned predictor $f_{\mathrm{pred}}^m$ , with self-attention over available features and read-only prompts.

Formally, the embedding space $Z$ is regularized so that

$L_{\mathrm{prd}}(z^m, \hat{z}^m) = \lambda \|z^m - \hat z^m\|_F^2/B + \mu [v(z^m) + v(\hat z^m)] + \nu [c(z^m) + c(\hat z^m)]$

where $v$ and $c$ are batch variance and covariance regularizers (VICReg style).

Prompt tokens $\phi^m$ , together with a small MLP, realize missing-modality synthesis with minimal parameter budget ( $<0.01\%$ ). Fusion uses late fusion of the (real + imputed) embeddings with their classifier heads.

This design achieves graceful degradation, outperforming backbone-finetuned baseline transformers such as ViLT by large margins when the dominant modality is missing at test time (Kim et al., 2024).

3.2 Universal Multi-Stage Masked Autoencoders and CLIP-Hypernetworks

For dense, high-dimensional tasks (e.g., 3D MRI segmentation), architectures such as multimodal masked autoencoders $M^2$ AE combined with CLIP-driven hyper-networks (Zhao et al., 2024) extend SMIL concepts to arbitrary missing patterns and to all-stage missingness (train + test).

Training alternates between:

A MAE learning to reconstruct both missing modalities (entire volumes) and masked patches.
A distribution approximation constraint: reconstructed samples are validated with a frozen downstream segmentation network, minimizing segmentation loss as a proxy for true data likelihood.
Data/model co-distillation: student segmentation model is regularized by a frozen teacher evaluated on reconstructed full-modality stacks; personalization is achieved for each missing pattern by generating weights via a CLIP-text+visual-hypernetwork.

Performance is robust as the fraction of full-modality samples drops to 1%, with only minor accuracy declines; ablation confirms the contribution of each loss component (Zhao et al., 2024).

3.3 Shared-Specific Factorization, Mutual Learning, and Modality Distillation

Alternative approaches consistent with SMIL's philosophy decompose features into shared and specific components, align distributions, and generate missing features via averaging or attention (Wang et al., 2023). Dual-branch masked mutual learning with hierarchical consistency constraints (Liang et al., 10 Jul 2025), and one-stage modality distillation with joint adaptation and cross-translation (Wei et al., 2023), each provide complementary strategies for robust knowledge transfer across missing-modality scans.

4. Training, Inference, and Theoretical Foundations

All SMIL-class methods rely on explicit missing-modality simulation at train-time. This is realized by randomly dropping combinations of modalities in the input batches, ensuring that all possible patterns are seen and internalized by the optimization process (Ma et al., 2021, Zhao et al., 2024, Liang et al., 10 Jul 2025). For each incomplete sample, missing modalities are reconstructed in latent space (via meta-learned mixture, prompt-prediction, or autoencoding).

Inference is universal: given any subset of modalities, the model seamlessly generates required embeddings and proceeds with prediction via the original task head. Theoretical grounding is often provided by:

Variational bounds (ELBO): guaranteeing that the mixture of incomplete and complete data is handled under a coherent probabilistic model.
Maximum likelihood estimation: joint models parameterized to optimize the likelihood over all observed label/modality pairs, including closed-form marginalization for missing modalities (Ma et al., 2021).

Notably, approaches such as the generalized softmax (Ma et al., 2021) introduce closed-form solutions for $Q_{Z|X}$ via empirical marginalization over observed $Y$ , enabling efficient learning at up to 95% missing rates.

5. Empirical Performance and Benchmarks

Quantitative evaluation affirms that SMIL-style methods are state-of-the-art across a spectrum of settings:

MM-IMDb, CMU-MOSI, avMNIST: SMIL achieves F1 improvements of 3–6 points over autoencoder and GAN baselines at 90% missing (Ma et al., 2021).
BraTS 2018/2020: Autoencoding and co-distilled models outperform M³AE, meta-learning, and RFNet by 5–20 points Dice at 1% full-modality rates (Zhao et al., 2024).
Visual/Textual Classification: Prompt-based SMIL yields 13-point gains in F1 (MM-IMDb, 90% missing) over ViLT; classification accuracy degrades slowly even as dominant modality is dropped (Kim et al., 2024, Lee et al., 2023).
Brain tumor segmentation: Dual-branch, mutual learning approaches exceed SOTA by 1.3–1.7% mean DSC, robust to all (1–3)-modality missing scenarios (Liang et al., 10 Jul 2025).

These methods consistently exhibit:

Graceful performance degradation under heavy missingness.
Robust separation and clustering of latent classes (visualized via t-SNE).
Effective utilization of the incomplete data universe, regularizing against overfitting to the scarce full-modality subset.

6. Limitations, Open Questions, and Future Directions

Current SMIL-style solutions demonstrate marked limitations:

Computational overhead: Bayesian meta-learning and Monte Carlo sampling induce training cost, particularly on high-dimensional modalities (Ma et al., 2021).
Need for a small complete set: For cluster prior formation, joint-embedding calibration, or marginalization, a nonzero anchor of paired samples is essential (Ma et al., 2021, Kim et al., 2024).
Non-random missingness: Most approaches implicitly assume missing at random. Structured, adversarial, or causal missingness patterns present unsolved challenges.
Scalability: As $M$ increases, the complexity of prompt configurations, meta-learned generators, and fused predictors scales superlinearly (Kim et al., 2024, Lee et al., 2023).

Potential future extensions include:

End-to-end adaptive clustering for feature priors.
Hierarchical or shared predictors for high-modality fusion.
Causal-inference integration for systematic missingness.
Lightweight generative priors fused with prompt conditioning.

SMIL's philosophy sharply contrasts with:

Reconstruction via autoencoders/GANs: These are brittle when full-modal data is scarce or imputation is high-dimensional (Ma et al., 2021).
Attention/fusion or distribution-alignment only models: While effective under moderate missingness, they lack the explicit adaptation, imputation regularization, and meta-learned prior structure essential for extreme regimes (Wang et al., 2023, Wang et al., 2023).
Maximum-likelihood/energy-based models: Direct modeling of conditional likelihoods over merged complete and incomplete data provides strong theoretical guarantees and superior performance uptil severe missingness (Ma et al., 2021).
Prompt tuning with minimal finetuning: Prompt-based approaches address resource efficiency and backbone stability; when combined with joint-embedding, they approximate optimal SMIL performance at low computational cost (Kim et al., 2024, Lee et al., 2023).

A plausible implication is that future state-of-the-art under severe missingness will integrate meta-learned latent reconstruction, joint-embedding or co-distillation, distributional alignment, parameter-efficient prompt and hypernetwork adaptation, and theoretical likelihood optimization into a unified, robust, extensible system.

References:

(Ma et al., 2021) SMIL: Multimodal Learning with Severely Missing Modality
(Zhao et al., 2024) Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization
(Kim et al., 2024) Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
(Wang et al., 2023) Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling
(Ma et al., 2021) Maximum Likelihood Estimation for Multimodal Learning with Missing Modality
(Lee et al., 2023) Multimodal Prompting with Missing Modalities for Visual Recognition
(Liang et al., 10 Jul 2025) Semantic-guided Masked Mutual Learning for Multi-modal Brain Tumor Segmentation with Arbitrary Missing Modalities
(Wang et al., 2023) Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality