MultiMAE: Multi-modal Multi-task Masked Autoencoders (2204.01678v1)

Published 4 Apr 2022 in cs.CV and cs.LG

Abstract: We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence "multi-modal"), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence "multi-task"). We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available - in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and tasks, we train MultiMAE entirely using pseudo labeling, which makes the framework widely applicable to any RGB dataset. The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer.

PDF Abstract

Understanding MultiMAE: Multi-modal Multi-task Masked Autoencoders

The paper "MultiMAE: Multi-modal Multi-task Masked Autoencoders" presents an innovative pre-training framework designed to enhance the performance of Vision Transformers (ViTs) by employing a multi-modal, multi-task approach. This novel method introduces two primary advancements over the conventional Masked Autoencoders (MAEs): the integration of multiple input data modalities and the incorporation of multiple prediction outputs. The resulting framework claims to offer a versatile, efficient means to improve model transferability to various downstream tasks, showcasing its functionality across several challenging datasets and tasks.

Core Contributions of MultiMAE

Multi-modal Input Capability: MultiMAE extends beyond standard RGB inputs by optionally accepting additional modalities such as depth and semantic segmentation data. This approach is critical for contexts where multiple sources of information are available, which can be harnessed to learn richer feature representations.
Multi-task Output Prediction: The model is equipped to predict various outputs, facilitating the development of more generalized and transferable representations. By doing so, MultiMAE addresses the limitations of single-modality MAEs that focus purely on RGB inputs and outputs.
Efficient Pre-training: By using a combination of masking strategies across image patches and modalities, MultiMAE ensures effective and tractable training. This masking technique not only supports the model's learning efficiency but also encourages cross-modal predictive coding.
Pseudo-labeling for Flexibility: MultiMAE employs pseudo-labeling, allowing it to be pre-trained on any RGB dataset without needing exhaustive annotated multi-modal data. This approach broadens the applicability of MultiMAE, leveraging the power of existing neural networks to generate pseudo-labels for depth and semantic segmentation.

Empirical Validation and Results

The framework's effectiveness was empirically validated on several datasets such as ImageNet, ADE20K, Taskonomy, Hypersim, and NYUv2, covering tasks such as image classification, semantic segmentation, and depth estimation. The findings highlighted the model's superior cross-modality integration capabilities and competitive or superior performance when compared to traditional MAEs and other baseline models.

ImageNet Classification: MultiMAE's integration of multi-modal learning retains competitive performance on the ImageNet classification task, demonstrating its ability to match the performance of traditional methods like MAE, while offering additional functionality.
Semantic Segmentation and Depth Estimation: The addition of depth and semantic segmentation inputs significantly boosted performance in related tasks, particularly when ground truth or pseudo-labeled depth maps were available.
Flexibility in Input Modalities: The framework's design allows flexibility in the input modalities used during transfer. For instance, when sensory depth data is available, MultiMAE can outpace baseline models by effectively leveraging this data to enhance prediction accuracy.

Future Directions Indicated by the Research

The paper suggests several future explorations to harness the full potential of MultiMAE. These include scaling up to include more diverse modalities such as video and text, utilizing larger and more varied datasets for pre-training, and investigating probabilistic approaches to handle output ambiguity more effectively. Additionally, modifications to the masking strategy could enhance the learning of even more robust feature representations.

In summary, MultiMAE emerges as a promising frontier for multi-modal, multi-task learning with vision transformers, providing significant improvements in flexibility and applicability across various vision-based tasks. While there are avenues for further enhancement, the current results demonstrate its potential to redefine pre-training strategies in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Roman Bachmann (9 papers)
David Mizrahi (4 papers)
Andrei Atanov (12 papers)
Amir Zamir (28 papers)

Citations (230)

View on Semantic Scholar

MultiMAE: Multi-modal Multi-task Masked Autoencoders (2204.01678v1)

Understanding MultiMAE: Multi-modal Multi-task Masked Autoencoders

Core Contributions of MultiMAE

Empirical Validation and Results

Future Directions Indicated by the Research

Related Papers