Multi-modal & Multi-task MAE

Updated 9 February 2026

Multi-modal and multi-task MAE are unified self-supervised frameworks that tokenize and process heterogeneous data using modality-adaptive projections.
They employ advanced masking strategies that randomly omit tokens and entire modalities to enforce cross-modal predictive coding and robust reconstruction.
These frameworks enable flexible transfer to diverse tasks such as segmentation, classification, and depth estimation, demonstrating strong performance across multiple domains.

Multi-modal and multi-task Masked Autoencoders (MAE) are an extension of self-supervised representation learning, designed to handle heterogeneous data types (modalities) and multiple predictive objectives (tasks) within a unified framework. By integrating architectural innovations, advanced masking schemes, and cross-modality reasoning, these models enable robust pretraining and flexible transfer, including scenarios with missing modalities, which are prevalent in practical applications such as medical imaging, geospatial analysis, autonomous driving, UAV scene understanding, and point cloud processing.

1. Foundations and Architectural Principles

Multi-modal and multi-task MAE frameworks generalize the vanilla MAE design by incorporating the following architectural elements:

Input Tokenization and Modality-Adaptive Projections: Each modality (e.g., MRI sequences, RGB images, depth, LiDAR point clouds, SAR, segmentation maps) is tokenized independently, typically into non-overlapping patches or voxels. These are projected into a shared embedding space via modality-specific adapters (e.g., linear projections, as in brain MRI MultiMAE (Erdur et al., 14 Sep 2025); 3D/2D encoders in point cloud models (Liu et al., 23 Jul 2025); ConvNeXt patch embedding for Earth observation (Nedungadi et al., 2024)).
Shared Encoder with Late Fusion: Unmasked tokens from all modalities are concatenated and processed jointly in a shared encoder, typically based on ViT or convolutional backbones. The “late fusion” paradigm ensures cross-modal interactions are mediated by self-attention rather than by stacking channels at the input (Erdur et al., 14 Sep 2025, Bachmann et al., 2022, Sosa et al., 20 May 2025, Zou et al., 2023).
Task- and Modality-specific Decoders: Multiple lightweight decoders are attached, one per target (modality or downstream task). Each decoder receives relevant tokens and reconstructs either masked inputs (self-supervised) or produces task outputs, often via a combination of cross-attention and Transformer or convolutional blocks (Bachmann et al., 2022, Erdur et al., 14 Sep 2025, Sosa et al., 20 May 2025, Nedungadi et al., 2024).
Single-model Multi-task Learning: A common encoder is pretrained to serve many downstream tasks (e.g., classification, segmentation, depth estimation, shape completion), including by multitask heads (Liu et al., 23 Jul 2025, Nedungadi et al., 2024, Sosa et al., 20 May 2025, Mihai-Cristian et al., 11 Oct 2025).

2. Masking and Cross-Modality Pretraining Strategies

Multi-modal MAE methods are distinguished by sophisticated masking and reconstruction strategies:

Joint Spatial and Modal Masking: Patches are randomly masked both within and across modalities. For $M$ modalities, a Dirichlet or Bernoulli distribution is used to sample the masking proportion per modality at every training iteration (Bachmann et al., 2022, Erdur et al., 14 Sep 2025, Sosa et al., 20 May 2025, Mihai-Cristian et al., 11 Oct 2025). Extreme masking ratios (up to 75% of all patches across all modalities) enforce reliance on both intra- and cross-modal prediction.
Entirely Missing Modalities: Probabilistic schemes (e.g., Dirichlet with $\alpha=1$ or independent Bernoulli masking per modality) ensure the model occasionally observes samples where one or more modalities are entirely absent. This explicitly trains the model to “hallucinate” or infer missing modalities from the present ones (Erdur et al., 14 Sep 2025, Bachmann et al., 2022, Mihai-Cristian et al., 11 Oct 2025).
Multi-task Pretext Losses: Losses are summed (often equally weighted) across all modalities and/or tasks, typically with reconstruction losses (MSE, L1, cross-entropy) applied only to masked positions (Bachmann et al., 2022, Erdur et al., 14 Sep 2025, Sosa et al., 20 May 2025, Nedungadi et al., 2024). Some frameworks add auxiliary losses for contrastive alignment or discriminative objectives (Liu et al., 23 Jul 2025, Mihai-Cristian et al., 11 Oct 2025), but in several settings, pure reconstruction loss suffices.
Cross-modal Predictive Coding: The shared encoder must capture both spatial and semantic dependencies within modalities and correlations across them, facilitating cross-modality predictive coding (e.g., inferring semantic segmentation from RGB + depth; synthesizing FLAIR from T1/T2 in MRIs; reconstructing point geometry from images) (Bachmann et al., 2022, Erdur et al., 14 Sep 2025, Zou et al., 2023, Liu et al., 23 Jul 2025).

3. Handling Missing Modalities and Flexible Inference

A defining capability of multi-modal MAEs is robustness to missing inputs at inference:

Natural Omission of Modalities: Any subset of modalities present at inference can be input—unavailable modalities are simply omitted, and the model reconstructs or predicts as trained (Erdur et al., 14 Sep 2025, Bachmann et al., 2022, Sosa et al., 20 May 2025). This contrasts with traditional multi-input models, which often require all modalities and degrade abruptly or require imputation when data are missing.
Ensemble Predictions via Masking: Probabilistic Hyper-Graph MAE (PHG-MAE) (Mihai-Cristian et al., 11 Oct 2025) extends this by ensembling predictions across multiple random masks at inference, yielding more stable and temporally consistent outputs, particularly in video or sequence settings.
Graceful Performance Degradation: Empirically, models pretrained with systematic cross-modal masking degrade gracefully as more modalities are withheld, often outperforming single-modal or naively-imputed baselines by substantial margins—even for downstream tasks not explicitly present at pretraining (Erdur et al., 14 Sep 2025, Sosa et al., 20 May 2025, Bachmann et al., 2022).

4. Multi-task Transfer and Downstream Adaptation

The pretrained encoder can be flexibly adapted to diverse downstream tasks:

Classification, Segmentation, Regression: By attaching appropriate task heads (linear, UNet-like, MLP, or Transformer decoders), multi-modal MAEs serve as backbone encoders in settings including image/point classification, semantic/instance segmentation, depth regression, 3D object detection, and BEV segmentation (Erdur et al., 14 Sep 2025, Sosa et al., 20 May 2025, Zou et al., 2023, Liu et al., 23 Jul 2025, Nedungadi et al., 2024).
Transfer Protocols: Fine-tuning can be performed either by updating the encoder and task head jointly or with frozen encoder and learned head (linear probing, frozen encoder + decoder in segmentation). The feature aggregation scheme depends on input configuration (e.g., averaging patch tokens across modalities at a spatial location in segmentation (Erdur et al., 14 Sep 2025)).
Empirical Results:
- Brain MRIs: +10.1 Dice in segmentation and +0.46 MCC in classification compared to MAE-ViT, under missing-modality evaluation (Erdur et al., 14 Sep 2025).
- Earth Observation: MultiMAE achieves 97.3% top-1 accuracy (full fine-tune) on m-eurosat and 70.25 mAP on m-bigearthnet, improving over single-modal or specialist-transfer baselines (Sosa et al., 20 May 2025).
- UAV Scenes (PHG-MAE): Mask-ensemble prediction lifts weighted mIoU to ~55 vs ~33–41 for UNet/Graph baselines, with 98% temporal consistency (Mihai-Cristian et al., 11 Oct 2025).
- Point Clouds (MMPT): Multi-modal, multi-task pretraining achieves 93.9% accuracy on ModelNet40, 86.4% on ScanObjectNN, and boosts 3D shape completion over prior methods (Liu et al., 23 Jul 2025).
- Autonomous Driving (UniM $^2$ AE): Joint camera–LiDAR pretraining yields +1.2 NDS and +6.5 mIoU in 3D detection and BEV segmentation, respectively, on nuScenes, compared to unimodal or naïve fusion (Zou et al., 2023).

5. Theoretical Extensions and Variants

Recent research has explored generalizations and alternative perspectives:

Probabilistic Hyper-Graph Formulation: PHG-MAE (Mihai-Cristian et al., 11 Oct 2025) integrates random modality masking with probabilistic hyper-graph theory, where each masking pattern corresponds to sampling a hyper-edge. This reveals an ensemble expectation interpretation and provides a principled foundation for uncertainty quantification and knowledge distillation into compact student models (retaining high mIoU with <1M parameters).
Multi-Pretext and Cross-Modal Proxy Tasks: Models like MP-MAE (Nedungadi et al., 2024) simultaneously reconstruct diverse pixel-level and image-level modalities (e.g., land cover, DEM, biome label, time, geolocation) in Earth observation. Task-uncertainty weighting is sometimes applied to balance learning across high- and low-noise modalities.
3D Fusion and Volumetric Interaction: UniM $^2$ AE (Zou et al., 2023) fuses heterogeneous modalities into a unified 3D latent volume (using camera-to-voxel and LiDAR-to-voxel projections) and applies deformable 3D self-attention, preserving spatial alignment and cross-modal context with minimal information loss.
Contrastive and Generative Multi-task Learning: Some frameworks combine discriminative (contrastive/invariance-based) and generative (reconstruction-based) signals, particularly in 3D tasks—e.g., MMPT uses token-level, point-level, and cross-modal contrastive losses, with specific weighting schemes ( $\lambda_1=1, \lambda_2=1, \lambda_3=0.1$ ) for optimal transfer (Liu et al., 23 Jul 2025).

6. Limitations and Remaining Challenges

Despite significant progress, open problems remain:

Blurry Reconstructions and Perceptual Losses: Training with pure MSE can result in blurred predictions, notably at fine boundaries (e.g., tumor edges in MRIs (Erdur et al., 14 Sep 2025)). Incorporating perceptual or adversarial losses is suggested to sharpen outputs.
Modality Number and Diversity: Scaling to more input modalities (text, audio, LiDAR, radar), or non-image domains, entails architectural and data harmonization challenges (Bachmann et al., 2022, Erdur et al., 14 Sep 2025, Liu et al., 23 Jul 2025).
Efficient Fine-tuning and Domain Shift: Pretrained models may require adaptation for distributional shifts (e.g., medical imaging scanners, geographies in EO), necessitating careful fine-tuning or domain adaptation (Erdur et al., 14 Sep 2025).
Computational Cost and Model Compactness: Large ensembles and wide architectures provide accuracy but may be impractical in real-time or resource-limited settings. Student distillation and architectural compression are active directions (Mihai-Cristian et al., 11 Oct 2025).

7. Impact, Applications, and Outlook

Multi-modal and multi-task MAEs have demonstrated impact across diverse domains:

Domain	Example Modalities	Downstream Tasks	Key Papers
Medical Imaging	T1, T1c, T2, FLAIR MRIs	Tumor segmentation, subtype classification	(Erdur et al., 14 Sep 2025)
Earth Observation	Sentinel-2, Sentinel-1, DEM, canopy, landcover	Classification, segmentation	(Sosa et al., 20 May 2025, Nedungadi et al., 2024)
Autonomous Driving	Multi-view camera, LiDAR	3D object detection, BEV segmentation	(Zou et al., 2023)
UAV Scene Understanding	RGB, vegetation/water/sky/segmentation/depth	Semantic segmentation, geometry	(Mihai-Cristian et al., 11 Oct 2025)
3D Point Cloud Analysis	3D points, rendered 2D images	Classification, part segmentation, detection, completion	(Liu et al., 23 Jul 2025)

The modality-agnostic, task-agnostic design and flexible masking-based learning of these models facilitate strong transfer, robust missing-data handling, and efficient adaptation to novel tasks and incomplete observations. Continued advances in cross-modal integration, more sophisticated task loss balancing, and scaling strategies are likely to further enhance the generality and utility of multi-modal, multi-task MAE frameworks in both research and applied settings.

References:

MultiMAE: Multi-modal Multi-task Masked Autoencoders (Bachmann et al., 2022)
MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder (Erdur et al., 14 Sep 2025)
MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks (Sosa et al., 20 May 2025)
MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning (Nedungadi et al., 2024)
Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning (Mihai-Cristian et al., 11 Oct 2025)
Multi-modal Multi-task Pre-training for Improved Point Cloud Understanding (Liu et al., 23 Jul 2025)
UniM $^2$ AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving (Zou et al., 2023)