Multimodal Masked Autoencoders Learn Transferable Representations (2205.14204v3)

Published 27 May 2022 in cs.CV

Abstract: Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data. In this paper, we investigate whether a large multimodal model trained purely via masked token prediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks. We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks. Surprisingly, we find that M3AE benefits from a higher text mask ratio (50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the joint training of two data modalities. We also provide qualitative analysis showing that the learned representation incorporates meaningful information from both image and language. Lastly, we demonstrate the scalability of M3AE with larger model size and training time, and its flexibility to train on both paired image-text data as well as unpaired data.

PDF Abstract

Multimodal Masked Autoencoders: A Framework for Transferable Representation Learning

The paper, authored by Geng et al., investigates a novel approach to learning unified visual and linguistic representations through the implementation of Multimodal Masked Autoencoders (M3AE). The framework aims to address the limitations of current multimodal models, predominantly reliant on contrastive learning, which necessitates paired image-text data and separate encoders for each modality. M3AE circumvents these limitations by employing a unified encoder for both modalities trained via masked token prediction.

Architecture and Methodology

The M3AE integrates a simple yet scalable network architecture where image and text data are treated uniformly. The model masks a portion of input tokens and patches before passing the visible elements through a shared encoder. This process leverages the strengths of transformer architectures to facilitate the reconstruction of masked inputs without relying on separate modality-specific encoders or contrastive learning objectives. The empirical investigation reveals substantial flexibility and scalability in executing this approach across mixed datasets containing both paired and unpaired data, an advantage over traditional contrastive methodologies.

Empirical Evaluation

The M3AE was evaluated on the Conceptual Captions dataset (CC12M), demonstrating a superior ability to generalize to downstream tasks when compared to existing models like MAE and CLIP. Specifically, in linear classification tasks on ImageNet, M3AE evidenced robust performance enhancements. Remarkably, the model benefits from a higher text masking rate (50-90%) compared to established norms in LLMs such as BERT, suggesting that increased masking may induce a more integrated understanding across modalities in a multimodal network setting.

Results and Implications

Quantitative metrics reveal that M3AE pre-training leads to a significant performance uplift in image classification and out-of-distribution detection tasks over baseline models. These findings underscore the potential of M3AE in exploiting large-scale multimodal datasets with scalable flexibility, thus unlocking new possibilities for application in domains heavily reliant on visual and textual data coordination, such as visual reasoning and dialog systems.

Theoretical Contributions and Future Directions

Theoretically, the paper contributes to a rethinking of multimodal representation learning models. By eliminating dependency on paired modality data and embracing a higher token masking ratio, the authors propose modifications that challenge existing paradigms and open pathways for alternative model architectures that synthesize information from diverse data forms without specialized encoders. Future research could explore the expansion of M3AE applications and its efficiency improvements, especially in light of increasing energy consumption concerns associated with large-scale model training.

Conclusion

In conclusion, M3AE represents an innovative step toward more efficient and adaptable multimodal representation learning, offering practical advantages in alignment with contemporary AI challenges. Future work could delve into optimizing these models to further enhance their efficacy across diverse, real-world applications. The paradigm presented by Geng et al. demonstrates a promising avenue for unifying data modalities, potentially influencing future developments in AI for tasks requiring intricate data integrations.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Xinyang Geng (21 papers)
Hao Liu (497 papers)
Lisa Lee (25 papers)
Dale Schuurmans (112 papers)
Sergey Levine (531 papers)
Pieter Abbeel (372 papers)

Citations (105)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos