MultiMAE: Multi-modal Masked Autoencoders

Updated 6 May 2026

MultiMAE is a multi-modal masked autoencoder that extends conventional MAEs by reconstructing masked patches from diverse input types.
It employs modality-specific patch embedding and independent decoder heads to leverage cross-modal cues and handle missing modalities seamlessly.
Demonstrations in Earth observation, medical imaging, and surgical vision show state-of-the-art performance in classification, segmentation, and imputation tasks.

MultiMAE is a multi-modal, multi-task extension of masked autoencoders (MAEs), designed to enable foundation models to learn cross-modal predictive representations from diverse types of input data while efficiently addressing tasks such as classification, segmentation, and geometric inference. Unlike standard MAEs that operate on unimodal (typically RGB) inputs, MultiMAE simultaneously reconstructs masked patches from multiple input modalities, supporting flexibility and robustness in downstream tasks where only partial modalities may be present.

1. Architectural Principles

MultiMAE generalizes masked autoencoding to the multi-modal, multi-task regime by introducing two key design components: (1) modality-specific patch embedding layers, and (2) independent decoder heads per modality or task. The central idea is to expose a shared transformer encoder to a sparse, masked set of tokens drawn from heterogeneous input sources (e.g., RGB, depth, semantic, spectral bands), forcing the model to leverage cross-modal cues for reconstructing the missing content.

Encoder and Input Handling

Each modality $m$ (e.g., RGB, depth, segmentation, multi-spectral bands) is divided into non-overlapping spatial patches, producing $N_m$ patches per instance.
A modality-specific linear projection $E_m: \mathbb{R}^{P^2C_m} \rightarrow \mathbb{R}^D$ maps each patch to a shared $D$ -dimensional space (typically $D=768$ for ViT-B).
Discrete sine-cosine positional embeddings are added to each token, together with explicit modality embeddings tagging token origin.
Only a subset of patches are retained (visible); the visible subset across all modalities is determined jointly via a Dirichlet allocation to ensure balanced cross-modal exposure.
The shared ViT-based encoder operates on the concatenated set of visible tokens (plus a [CLS] token if used), mixing information from all present modalities in a late-fusion manner.

Per-Modality Decoders

For each modality or output type, a lightweight transformer-based decoder is instantiated (typically depth 2–3, hidden size 256–512).
The decoder receives both learnable mask tokens (for each masked patch in the modality) and access to the full encoded sequence via a cross-attention block.
Reconstruction is performed only on masked positions, projecting back from decoder output to patch space.
All decoder losses are aggregated equally in the overall loss.

This structure is consistent in adaptations to Earth Observation (Sosa et al., 20 May 2025), brain MRI (Erdur et al., 14 Sep 2025), computer vision transfer (Bachmann et al., 2022), and surgical vision (Han et al., 26 Jan 2026).

Masking is fundamental to MultiMAE. The masking scheme is extended from MAE to simultaneously mask across $M$ modalities, maintaining a high global masking ratio (typically $r \approx 0.75$ –$0.83$):

A symmetric Dirichlet distribution ( $\alpha=1$ for each modality) determines the proportions $q$ of visible patches allocated per modality in each sample.
The number of visible tokens for modality $N_m$ 0 is $N_m$ 1.
Masked positions are replaced in the decoder input by learnable mask tokens.
This approach ensures coverage of all masking regimes—including scenarios where some modalities are almost entirely masked—forcing the model to learn cross-modal imputation.

For 3D MRI, the masking operates independently per modality per patch, but the average coverage maintains the targeted global mask rate (Erdur et al., 14 Sep 2025).

3. Training Objectives and Loss Functions

The MultiMAE pre-training objective is a joint, per-modality masked reconstruction loss restricted to masked positions:

$N_m$ 2

where $N_m$ 3 denotes the modality-specific decoder, $N_m$ 4 the binary mask on the target modality, $N_m$ 5 the encoder outputs, and $N_m$ 6 the ground-truth data for modality $N_m$ 7. The loss is generally averaged uniformly over the masked patches in every modality.

In applications with pseudo-modality labels (e.g., computer vision on ImageNet-1K), loss functions are individualized:

Depth: $N_m$ 8 loss on pseudo-depth maps.
Segmentation: cross-entropy over class logits.
RGB: $N_m$ 9 loss or MSE.

Losses are not weighted across modalities unless otherwise specified.

4. Adaptations to Diverse Domains

Earth Observation (EO)

In EO, MultiMAE processes up to six modalities: four spectral bands (RGB, IRED, SIRED, EB), DEM elevation data, and land-cover segmentation. The encoder and decoders are matched to modality dimensions and channel counts. Experiments demonstrate that MultiMAE achieves state-of-the-art performance for both classification (e.g., 97.3% FT on m-eurosat, vs. prior best 92.2%) and segmentation (e.g., 81.99% mIoU, exceeding ConvNeXt and ViT-B baselines) (Sosa et al., 20 May 2025). Ablation studies reveal persistent benefits from multi-modal pre-training even if fine-tuning occurs with fewer modalities.

Medical Imaging (Brain MRI)

Adapting MultiMAE to 3D MRI involves dividing each input sequence (T1, T1c, T2, FLAIR) into non-overlapping 16³ patches and using independent patch embedding layers. The late-fusion encoder fuses token representations from all available modalities. Modality-specific decoders reconstruct masked patches. MultiMAE outperforms an early-fusion ViT-MAE baseline by a large margin on segmentation Dice (Δ=+10.1) and classification MCC (Δ=+0.46) in missing-input regimes. Omission of a modality at inference simply entails skipping its tokens and decoder; the encoder adapts seamlessly (Erdur et al., 14 Sep 2025).

Computer Vision Transfer

The original MultiMAE is pre-trained on RGB, depth, and segmentation (with pseudo-labels) sampled from generic image datasets. The encoder and masking protocol are fully general, permitting transfer with any subset of modalities at downstream time. MultiMAE delivers state-of-the-art results on multi-task semantic segmentation and depth estimation: e.g., NYUv2 segmentation mIoU 52.0 (vs. 50.8 for MAE), and consistent improvements when real or pseudo-depth is incorporated (Bachmann et al., 2022).

Surgical Vision

In surgical scene understanding, MultiMAE operates over RGB–depth pairs, with explicit geometric tokenization for depth. Pre-training on 1.4M surgical frames with synthetic depth yields consistent gains on downstream detection, segmentation, pose, and depth tasks, with improvements of +20–87% on standard metrics compared to unimodal MAE. Notably, depth is required only for pre-training; at inference, the model can be deployed on RGB images alone (Han et al., 26 Jan 2026).

5. Transfer Protocols and Downstream Adaptation

MultiMAE supports both frozen-encoder (linear probing) and end-to-end fine-tuning transfer. For classification:

Linear probes are trained on top of the (frozen) encoder.
Full fine-tuning updates the encoder and the task-specific head.

For segmentation:

A segmentation network (e.g., ConvNeXt block or UNETR) is attached, potentially using multi-layer skip connections.
For 3D inputs, spatial grid locations with missing modalities are filled by averaging tokens across present modalities.

MultiMAE's encoder is agnostic to missing modalities; missing tokens are simply omitted at both pre-training and downstream time. This property contrasts with early-fusion models that require a fixed set of input channels, which are typically filled with zeros for missing modalities and exhibit degraded performance as a result.

6. Quantitative Benchmarks and Ablation Insights

Extensive experiments demonstrate that MultiMAE consistently outperforms unimodal or early-fusion baselines:

Domain	Task	SOTA Baseline	MultiMAE Result	Reference
Earth Obs.	Classif. (m-eurosat FT)	DOFA (92.2%)	97.3%	(Sosa et al., 20 May 2025)
Earth Obs.	Segm. (m-cashew, FT)	ConvNeXt (75.9%)	81.99% mIoU	(Sosa et al., 20 May 2025)
Brain MRI	Segm. (Dice, missing)	ViT-MAE (Avg.)	Δ=+10.1	(Erdur et al., 14 Sep 2025)
Surgical Vision	Segm. (EndoVis18, FT)	MAE (23.5)	43.9 mIoU	(Han et al., 26 Jan 2026)
Computer Vision	NYUv2 mIoU (RGB, FT)	MAE (50.8)	52.0	(Bachmann et al., 2022)

Ablation studies in each domain highlight:

Multi-modal pre-training yields benefits retained even when downstream fine-tuning uses only a subset of modalities.
Masking with randomly sampled proportions per modality (Dirichlet allocation) is critical to cross-modal predictive coding.
Data efficiency: MultiMAE fine-tuned with substantially less labeled data matches or exceeds unimodal models trained at full scale (Han et al., 26 Jan 2026).

7. Implications and Future Directions

MultiMAE’s modular encoder–decoder design and masking protocol provide significant flexibility for both research and practical deployment. Key implications include:

Modality Flexibility: The shared encoder accommodates arbitrary subsets of input modalities without retraining or architectural changes, facilitating deployment in diverse data circumstances.
Inference Simplicity: Modalities used only for pre-training (e.g., depth) can be omitted at test time, incurring no inference cost.
Cross-Modal Predictive Coding: The model learns representations that bridge modalities, which is crucial when some modalities are absent or require imputation.
Extension Potential: There is scope to include further modalities (SAR, text, temporal), alternative masking strategies, and to unify contrastive and masked modeling objectives.
Domain-Specific Pre-training: Using domain-matched pseudo-modalities for pre-training improves downstream performance robustness.

A plausible implication is that MultiMAE frameworks represent a scalable approach to robust, generalist foundation models in vision, remote sensing, and medical domains, particularly in environments with heterogeneous and incomplete data streams.

References:

(Bachmann et al., 2022): "MultiMAE: Multi-modal Multi-task Masked Autoencoders"
(Sosa et al., 20 May 2025): "MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks"
(Erdur et al., 14 Sep 2025): "MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder"
(Han et al., 26 Jan 2026): "On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training"

Markdown Report Issue Upgrade to Chat

References (4)

MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks (2025)

MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder (2025)

MultiMAE: Multi-modal Multi-task Masked Autoencoders (2022)

On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiMAE.

MultiMAE: Multi-modal Masked Autoencoders

1. Architectural Principles

Encoder and Input Handling

Per-Modality Decoders

3. Training Objectives and Loss Functions

4. Adaptations to Diverse Domains

Earth Observation (EO)

Medical Imaging (Brain MRI)

Computer Vision Transfer

Surgical Vision

5. Transfer Protocols and Downstream Adaptation

6. Quantitative Benchmarks and Ablation Insights

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MultiMAE: Multi-modal Masked Autoencoders

1. Architectural Principles

Encoder and Input Handling

Per-Modality Decoders

2. Masking Strategy and Multi-Modal Mask Sampling

3. Training Objectives and Loss Functions

4. Adaptations to Diverse Domains

Earth Observation (EO)

Medical Imaging (Brain MRI)

Computer Vision Transfer

Surgical Vision

5. Transfer Protocols and Downstream Adaptation

6. Quantitative Benchmarks and Ablation Insights

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research