Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

4M: Massively Multimodal Masked Modeling (2312.06647v1)

Published 11 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent LLMs exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities - including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens. 4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for unseen downstream tasks or new input modalities, and (3) they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility. Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.

Overview of "4M: Massively Multimodal Masked Modeling"

The paper "4M: Massively Multimodal Masked Modeling" introduces a novel approach to develop versatile computer vision models capable of handling multiple modalities and tasks. Unlike traditional vision models that are highly specialized, the authors aim to mimic the scalability and multitask capabilities of recent LLMs by proposing a training framework that unifies various input and output modalities.

Methodology

The 4M framework utilizes a single unified Transformer encoder-decoder architecture, trained using a multimodal masked modeling objective. This approach is designed to scale across several modalities, including text, images, geometry, semantics, and neural network feature maps. The key innovation lies in representing all these diverse modalities as discrete token sequences, enabling the model to learn shared representations and predictive coding across modalities.

The training comprises a random subset of tokens as inputs and another subset as targets, allowing efficient handling of increasing modalities without excessive computational costs. This versatile training scheme supports a wide array of vision tasks out-of-the-box and facilitates effective transfer to unseen tasks and modalities. It also serves as a generative model capable of multimodal conditional generation, offering expressive editing capabilities.

Experimental Results

4M models demonstrate impressive generalization and adaptability, outperforming several baseline models across different benchmark tasks such as COCO detection, ADE20K segmentation, and NYUv2 depth estimation. Notably, while the model excels in multimodal tasks, its performance on single-modal tasks like ImageNet-1K is slightly surpassed by specialized models such as DeiT III. The paper also highlights the model's generative capabilities, showcasing its ability to perform tasks such as multimodal editing and semantic generation with high configurability using chaining and guidance techniques.

Future Implications

The 4M framework presents significant theoretical and practical implications. On a theoretical level, it helps bridge the gap between vision and LLMs, suggesting a pathway towards foundation models that can learn multimodal representations efficiently. Practically, the framework opens up avenues for developing more generalized AI systems that are not confined to specific data types or tasks.

Looking forward, future work could expand upon 4M by incorporating additional modalities, improving tokenizer quality, and utilizing larger, more diverse datasets. As AI systems continue to evolve, the pursuit of more comprehensive and unified models like 4M will be critical in overcoming current siloed approaches, ultimately leading to more advanced and versatile AI solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (133)
  1. CM3: A causal masked multimodal model of the internet. arXiv:2201.07520, 2022.
  2. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, 2023.
  3. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
  4. SiT: Self-supervised vision transformer. arXiv:2104.03602, 2021.
  5. MultiMAE: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, 2022.
  6. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, 2022.
  7. BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
  8. Visual prompting via image inpainting. In Advances in Neural Information Processing Systems, 2022.
  9. MultiDiffusion: Fusing diffusion paths for controlled image generation. In International Conference on Machine Learning, 2021.
  10. Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 2000.
  11. MulT: An end-to-end multitask learning transformer. In Conference on Computer Vision and Pattern Recognition, 2022.
  12. Language models are few-shot learners. arXiv:2005.14165, 2020.
  13. Bfloat16 processing for neural networks. In Symposium on Computer Arithmetic, 2019.
  14. COYO-700M: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  15. Cascade R-CNN: Delving into high quality object detection. In Conference on Computer Vision and Pattern Recognition, 2018.
  16. Rich Caruana. Multitask learning. Machine Learning, 1997.
  17. MaskGIT: Masked generative image transformer. In Conference on Computer Vision and Pattern Recognition, 2022.
  18. Muse: Text-to-image generation via masked generative transformers. In International Conference on Machine Learning, 2023.
  19. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Conference on Computer Vision and Pattern Recognition, 2021.
  20. Generative pretraining from pixels. In International Conference on Machine Learning, 2020a.
  21. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations, 2022a.
  22. A unified sequence interface for vision tasks. In Advances in Neural Information Processing Systems, 2022b.
  23. OASIS: A large-scale dataset for single image 3d in the wild. In Conference on Computer Vision and Pattern Recognition, 2020b.
  24. Masked-attention mask transformer for universal image segmentation. In Conference on Computer Vision and Pattern Recognition, 2022.
  25. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
  26. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
  27. Randaugment: Practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, 2020.
  28. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, 2023.
  29. ImageNet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.
  30. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
  31. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  32. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In International Conference on Computer Vision, 2021.
  33. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In International Conference on Computer Vision, 2015.
  34. Are large-scale datasets necessary for self-supervised pre-training? arXiv:2112.10740, 2021.
  35. Taming transformers for high-resolution image synthesis. In Conference on Computer Vision and Pattern Recognition, 2021.
  36. EVA-02: A visual representation for neon genesis. arXiv:2303.11331, 2023a.
  37. EVA: Exploring the limits of masked visual representation learning at scale. In Conference on Computer Vision and Pattern Recognition, 2023b.
  38. Masked autoencoders as spatiotemporal learners. In Advances in Neural Information Processing Systems, 2022.
  39. DataComp: In search of the next generation of multimodal datasets. arXiv:2304.14108, 2023.
  40. Make-A-Scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, 2022.
  41. Simple copy-paste is a strong data augmentation method for instance segmentation. In Conference on Computer Vision and Pattern Recognition, 2021a.
  42. Multi-task self-training for learning general representations. In International Conference on Computer Vision, 2021b.
  43. ImageBind: One embedding space to bind them all. In Conference on Computer Vision and Pattern Recognition, 2023a.
  44. OmniMAE: Single model masked pretraining on images and videos. In Conference on Computer Vision and Pattern Recognition, 2023b.
  45. Google. PaLM 2 technical report. https://ai.google/static/documents/palm2techreport.pdf, 2023.
  46. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706:02677, 2017.
  47. Mask r-cnn. Transactions on Pattern Analysis and Machine Intelligence, 2017.
  48. Masked autoencoders are scalable vision learners. In Conference on Computer Vision and Pattern Recognition, 2022.
  49. Gaussian error linear units (GELUs). arXiv:1606.08415, 2016.
  50. Classifier-free diffusion guidance. arXiv:2207.12598, 2022.
  51. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  52. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 2022.
  53. Autoregressive diffusion models. In International Conference on Learning Representations, 2022.
  54. UniT: Multimodal multitask learning with a unified transformer. In International Conference on Computer Vision, 2021.
  55. Deep networks with stochastic depth. In European Conference on Computer Vision, 2016.
  56. Masked autoencoders that listen. In Advances in Neural Information Processing Systems, 2022a.
  57. Language is not all you need: Aligning perception with language models. arXiv:2302.14045, 2023.
  58. Multimodal conditional image synthesis with product-of-experts gans. In European Conference on Computer Vision, 2022b.
  59. Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.
  60. Scaling laws for neural language models. arXiv:2001.08361, 2020.
  61. 3d common corruptions and data augmentation. In Conference on Computer Vision and Pattern Recognition, 2022.
  62. X&Fuse: Fusing visual information in text-to-image generation. arXiv:2303.01000, 2023.
  63. Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Conference on Computer Vision and Pattern Recognition, 2017.
  64. UViM: A unified modeling approach for vision with learned guiding codes. In Advances in Neural Information Processing Systems, 2022.
  65. MAGE: Masked generative encoder to unify representation learning and image synthesis. In Conference on Computer Vision and Pattern Recognition, 2023.
  66. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, 2022.
  67. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, 2014.
  68. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, 2022a.
  69. Prismer: A vision-language model with an ensemble of experts. arXiv:2303.02506, 2023.
  70. Exploring target representations for masked autoencoders. arXiv:2209.03917, 2022b.
  71. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision, 2021.
  72. A convnet for the 2020s. In Conference on Computer Vision and Pattern Recognition, 2022c.
  73. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  74. Unified-IO: A unified model for vision, language, and multi-modal tasks. In International Conference on Learning Representations, 2023.
  75. Improving diffusion model efficiency through patching. arXiv:2207.04316, 2022.
  76. Attention bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems, 2021.
  77. Transframer: Arbitrary frame prediction with generative models. arXiv:2203.09494, 2022.
  78. NegPrompt. Negative prompt. https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Negative-prompt, 2022.
  79. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021.
  80. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, 2021.
  81. OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  82. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. arXiv:2208.06366, 2022.
  83. Tuning computer vision models with task rewards. In International Conference on Machine Learning, 2023.
  84. Language models are unsupervised multitask learners. 2019.
  85. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  86. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  87. ZeRO: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
  88. Zero-shot text-to-image generation. In International Conference on Machine Learning, 2021.
  89. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022.
  90. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. Transactions on Pattern Analysis and Machine Intelligence, 2020.
  91. Vision transformers for dense prediction. In International Conference on Computer Vision, 2021.
  92. A generalist agent. Transactions on Machine Learning Research, 2022.
  93. ImageNet-21K pretraining for the masses. arXiv:2104.10972, 2021.
  94. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision, 2020.
  95. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, 2022.
  96. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2014.
  97. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  98. Mid-level visual representations improve generalization and sample efficiency for learning visuomotor policies. arXiv:1812.11971, 2018.
  99. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.
  100. Noam Shazeer. GLU variants improve transformer. arXiv:2002.05202, 2020.
  101. DiVAE: Photorealistic images synthesis with denoising diffusion decoder. arXiv:2206.00386, 2022.
  102. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision, 2012.
  103. FLAVA: A foundational language and vision alignment model. In Conference on Computer Vision and Pattern Recognition, 2022.
  104. The development of embodied cognition: Six lessons from babies. Artificial Life, 2005.
  105. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  106. UL2: Unifying language learning paradigms. In International Conference on Learning Representations, 2023.
  107. Rethinking few-shot image classification: a good embedding is all you need? In European Conference on Computer Vision, 2020.
  108. DeiT III: Revenge of the vit. In European Conference on Computer Vision, 2022.
  109. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, 2020.
  110. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.
  111. DIODE: A dense indoor and outdoor depth dataset. arXiv:1908.00463, 2019.
  112. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  113. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH Conference, 2023.
  114. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv:2208.10442, 2022.
  115. Images speak in images: A generalist painter for in-context visual learning. In Conference on Computer Vision and Pattern Recognition, 2023.
  116. WebDataset. Webdataset. https://github.com/webdataset/webdataset, 2022.
  117. Masked feature prediction for self-supervised visual pre-training. In Conference on Computer Vision and Pattern Recognition, 2022.
  118. SimMIM: A simple framework for masked image modeling. In Conference on Computer Vision and Pattern Recognition, 2022.
  119. ReCo: Region-controlled text-to-image generation. In Conference on Computer Vision and Pattern Recognition, 2023.
  120. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, 2019.
  121. Vector-quantized image modeling with improved VQGAN. In International Conference on Learning Representations, 2022a.
  122. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022b.
  123. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022c.
  124. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision, 2019.
  125. Robust learning through cross-task consistency. In Conference on Computer Vision and Pattern Recognition, 2020.
  126. Taskonomy: Disentangling task transfer learning. In Conference on Computer Vision and Pattern Recognition, 2018.
  127. SoundStream: An end-to-end neural audio codec. Transactions on Audio, Speech, and Language Processing, 2022.
  128. Multimodal image synthesis and editing: A survey. arXiv:2112.13592, 2021.
  129. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  130. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision, 2023.
  131. Scene parsing through ADE20K dataset. In Conference on Computer Vision and Pattern Recognition, 2017.
  132. iBoT: Image BERT pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
  133. Uni-Perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Conference on Computer Vision and Pattern Recognition, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. David Mizrahi (4 papers)
  2. Roman Bachmann (9 papers)
  3. Oğuzhan Fatih Kar (10 papers)
  4. Teresa Yeo (10 papers)
  5. Mingfei Gao (26 papers)
  6. Afshin Dehghan (19 papers)
  7. Amir Zamir (28 papers)
Citations (38)
Youtube Logo Streamline Icon: https://streamlinehq.com