Papers
Topics
Authors
Recent
Search
2000 character limit reached

MultiMAE: Multi-modal Masked Autoencoders

Updated 6 May 2026
  • MultiMAE is a multi-modal masked autoencoder that extends conventional MAEs by reconstructing masked patches from diverse input types.
  • It employs modality-specific patch embedding and independent decoder heads to leverage cross-modal cues and handle missing modalities seamlessly.
  • Demonstrations in Earth observation, medical imaging, and surgical vision show state-of-the-art performance in classification, segmentation, and imputation tasks.

MultiMAE is a multi-modal, multi-task extension of masked autoencoders (MAEs), designed to enable foundation models to learn cross-modal predictive representations from diverse types of input data while efficiently addressing tasks such as classification, segmentation, and geometric inference. Unlike standard MAEs that operate on unimodal (typically RGB) inputs, MultiMAE simultaneously reconstructs masked patches from multiple input modalities, supporting flexibility and robustness in downstream tasks where only partial modalities may be present.

1. Architectural Principles

MultiMAE generalizes masked autoencoding to the multi-modal, multi-task regime by introducing two key design components: (1) modality-specific patch embedding layers, and (2) independent decoder heads per modality or task. The central idea is to expose a shared transformer encoder to a sparse, masked set of tokens drawn from heterogeneous input sources (e.g., RGB, depth, semantic, spectral bands), forcing the model to leverage cross-modal cues for reconstructing the missing content.

Encoder and Input Handling

  • Each modality mm (e.g., RGB, depth, segmentation, multi-spectral bands) is divided into non-overlapping spatial patches, producing NmN_m patches per instance.
  • A modality-specific linear projection Em:RP2Cm→RDE_m: \mathbb{R}^{P^2C_m} \rightarrow \mathbb{R}^D maps each patch to a shared DD-dimensional space (typically D=768D=768 for ViT-B).
  • Discrete sine-cosine positional embeddings are added to each token, together with explicit modality embeddings tagging token origin.
  • Only a subset of patches are retained (visible); the visible subset across all modalities is determined jointly via a Dirichlet allocation to ensure balanced cross-modal exposure.
  • The shared ViT-based encoder operates on the concatenated set of visible tokens (plus a [CLS] token if used), mixing information from all present modalities in a late-fusion manner.

Per-Modality Decoders

  • For each modality or output type, a lightweight transformer-based decoder is instantiated (typically depth 2–3, hidden size 256–512).
  • The decoder receives both learnable mask tokens (for each masked patch in the modality) and access to the full encoded sequence via a cross-attention block.
  • Reconstruction is performed only on masked positions, projecting back from decoder output to patch space.
  • All decoder losses are aggregated equally in the overall loss.

This structure is consistent in adaptations to Earth Observation (Sosa et al., 20 May 2025), brain MRI (Erdur et al., 14 Sep 2025), computer vision transfer (Bachmann et al., 2022), and surgical vision (Han et al., 26 Jan 2026).

2. Masking Strategy and Multi-Modal Mask Sampling

Masking is fundamental to MultiMAE. The masking scheme is extended from MAE to simultaneously mask across MM modalities, maintaining a high global masking ratio (typically r≈0.75r \approx 0.75–$0.83$):

  • A symmetric Dirichlet distribution (α=1\alpha=1 for each modality) determines the proportions qq of visible patches allocated per modality in each sample.
  • The number of visible tokens for modality NmN_m0 is NmN_m1.
  • Masked positions are replaced in the decoder input by learnable mask tokens.
  • This approach ensures coverage of all masking regimes—including scenarios where some modalities are almost entirely masked—forcing the model to learn cross-modal imputation.

For 3D MRI, the masking operates independently per modality per patch, but the average coverage maintains the targeted global mask rate (Erdur et al., 14 Sep 2025).

3. Training Objectives and Loss Functions

The MultiMAE pre-training objective is a joint, per-modality masked reconstruction loss restricted to masked positions:

NmN_m2

where NmN_m3 denotes the modality-specific decoder, NmN_m4 the binary mask on the target modality, NmN_m5 the encoder outputs, and NmN_m6 the ground-truth data for modality NmN_m7. The loss is generally averaged uniformly over the masked patches in every modality.

In applications with pseudo-modality labels (e.g., computer vision on ImageNet-1K), loss functions are individualized:

  • Depth: NmN_m8 loss on pseudo-depth maps.
  • Segmentation: cross-entropy over class logits.
  • RGB: NmN_m9 loss or MSE.

Losses are not weighted across modalities unless otherwise specified.

4. Adaptations to Diverse Domains

Earth Observation (EO)

In EO, MultiMAE processes up to six modalities: four spectral bands (RGB, IRED, SIRED, EB), DEM elevation data, and land-cover segmentation. The encoder and decoders are matched to modality dimensions and channel counts. Experiments demonstrate that MultiMAE achieves state-of-the-art performance for both classification (e.g., 97.3% FT on m-eurosat, vs. prior best 92.2%) and segmentation (e.g., 81.99% mIoU, exceeding ConvNeXt and ViT-B baselines) (Sosa et al., 20 May 2025). Ablation studies reveal persistent benefits from multi-modal pre-training even if fine-tuning occurs with fewer modalities.

Medical Imaging (Brain MRI)

Adapting MultiMAE to 3D MRI involves dividing each input sequence (T1, T1c, T2, FLAIR) into non-overlapping 16³ patches and using independent patch embedding layers. The late-fusion encoder fuses token representations from all available modalities. Modality-specific decoders reconstruct masked patches. MultiMAE outperforms an early-fusion ViT-MAE baseline by a large margin on segmentation Dice (Δ=+10.1) and classification MCC (Δ=+0.46) in missing-input regimes. Omission of a modality at inference simply entails skipping its tokens and decoder; the encoder adapts seamlessly (Erdur et al., 14 Sep 2025).

Computer Vision Transfer

The original MultiMAE is pre-trained on RGB, depth, and segmentation (with pseudo-labels) sampled from generic image datasets. The encoder and masking protocol are fully general, permitting transfer with any subset of modalities at downstream time. MultiMAE delivers state-of-the-art results on multi-task semantic segmentation and depth estimation: e.g., NYUv2 segmentation mIoU 52.0 (vs. 50.8 for MAE), and consistent improvements when real or pseudo-depth is incorporated (Bachmann et al., 2022).

Surgical Vision

In surgical scene understanding, MultiMAE operates over RGB–depth pairs, with explicit geometric tokenization for depth. Pre-training on 1.4M surgical frames with synthetic depth yields consistent gains on downstream detection, segmentation, pose, and depth tasks, with improvements of +20–87% on standard metrics compared to unimodal MAE. Notably, depth is required only for pre-training; at inference, the model can be deployed on RGB images alone (Han et al., 26 Jan 2026).

5. Transfer Protocols and Downstream Adaptation

MultiMAE supports both frozen-encoder (linear probing) and end-to-end fine-tuning transfer. For classification:

  • Linear probes are trained on top of the (frozen) encoder.
  • Full fine-tuning updates the encoder and the task-specific head.

For segmentation:

  • A segmentation network (e.g., ConvNeXt block or UNETR) is attached, potentially using multi-layer skip connections.
  • For 3D inputs, spatial grid locations with missing modalities are filled by averaging tokens across present modalities.

MultiMAE's encoder is agnostic to missing modalities; missing tokens are simply omitted at both pre-training and downstream time. This property contrasts with early-fusion models that require a fixed set of input channels, which are typically filled with zeros for missing modalities and exhibit degraded performance as a result.

6. Quantitative Benchmarks and Ablation Insights

Extensive experiments demonstrate that MultiMAE consistently outperforms unimodal or early-fusion baselines:

Domain Task SOTA Baseline MultiMAE Result Reference
Earth Obs. Classif. (m-eurosat FT) DOFA (92.2%) 97.3% (Sosa et al., 20 May 2025)
Earth Obs. Segm. (m-cashew, FT) ConvNeXt (75.9%) 81.99% mIoU (Sosa et al., 20 May 2025)
Brain MRI Segm. (Dice, missing) ViT-MAE (Avg.) Δ=+10.1 (Erdur et al., 14 Sep 2025)
Surgical Vision Segm. (EndoVis18, FT) MAE (23.5) 43.9 mIoU (Han et al., 26 Jan 2026)
Computer Vision NYUv2 mIoU (RGB, FT) MAE (50.8) 52.0 (Bachmann et al., 2022)

Ablation studies in each domain highlight:

  • Multi-modal pre-training yields benefits retained even when downstream fine-tuning uses only a subset of modalities.
  • Masking with randomly sampled proportions per modality (Dirichlet allocation) is critical to cross-modal predictive coding.
  • Data efficiency: MultiMAE fine-tuned with substantially less labeled data matches or exceeds unimodal models trained at full scale (Han et al., 26 Jan 2026).

7. Implications and Future Directions

MultiMAE’s modular encoder–decoder design and masking protocol provide significant flexibility for both research and practical deployment. Key implications include:

  • Modality Flexibility: The shared encoder accommodates arbitrary subsets of input modalities without retraining or architectural changes, facilitating deployment in diverse data circumstances.
  • Inference Simplicity: Modalities used only for pre-training (e.g., depth) can be omitted at test time, incurring no inference cost.
  • Cross-Modal Predictive Coding: The model learns representations that bridge modalities, which is crucial when some modalities are absent or require imputation.
  • Extension Potential: There is scope to include further modalities (SAR, text, temporal), alternative masking strategies, and to unify contrastive and masked modeling objectives.
  • Domain-Specific Pre-training: Using domain-matched pseudo-modalities for pre-training improves downstream performance robustness.

A plausible implication is that MultiMAE frameworks represent a scalable approach to robust, generalist foundation models in vision, remote sensing, and medical domains, particularly in environments with heterogeneous and incomplete data streams.


References:

  • (Bachmann et al., 2022): "MultiMAE: Multi-modal Multi-task Masked Autoencoders"
  • (Sosa et al., 20 May 2025): "MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks"
  • (Erdur et al., 14 Sep 2025): "MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder"
  • (Han et al., 26 Jan 2026): "On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiMAE.