Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CM3: A Causal Masked Multimodal Model of the Internet (2201.07520v1)

Published 19 Jan 2022 in cs.CL

Abstract: We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked LLMs, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

An Analysis of \HTLiM{}: A Causally Masked Multimodal Model of the Internet

The paper presents \HTLiM{}, a causally masked multimodal generative model designed to process and generate data comprising both text and images. Notably, the model is trained on a large corpus that includes structured documents such as web pages and Wikipedia articles, with data encoding through HTML markup and the incorporation of image tokens using the VQVAE-GAN framework.

Causally Masked Approach

The innovative aspect of this model is its causally masked objective function, which is a hybrid of traditional causal and masked modeling techniques. Typically, causal LLMs operate in a left-to-right manner, generating one token at a time based on the preceding context. In contrast, masked LLMs can leverage bidirectional information to fill in the blanks within a sentence. \HTLiM{} merges these approaches by generating longer token spans after processing the remainder of the input. This enables the model to maintain the benefits of bidirectional context during the generation, an attribute particularly beneficial when dealing with complex multimodal input-output scenarios.

Large-Scale Multimodal Learning

Training was executed on nearly a terabyte of data, featuring both text and image components. The images are distilled into tokens through a VQVAE-GAN methodology, which effectively captures varied image elements in a discrete token format. This training methodology diverges from traditional approaches, which often require carefully curated datasets with explicit text-image alignment. Here, the model ingests data in its native structure, retaining consistent representation in line with HTML presentation.

Zero-Shot and Few-Shot Task Performance

\HTLiM{} demonstrates exemplary performance across numerous zero-shot and few-shot tasks. It sets new benchmarks in zero-shot summarization and entity disambiguation, sustaining competitive results even in scenarios demanding fine-tuning. The model can generate images conditioned on text inputs (as seen with systems like DALL-E), and perform related tasks like image captioning and text-based image generation, all without task-specific training—underscoring its adaptability.

Comparison with Prior Work

The framework offers a significant advance over previous multimodal models, showing superior zero-shot capabilities due to its structured document training approach, which inherently provides extensive real-world context. In comparison, models like DALL-E, while handling image generation proficiently, are typically limited to text-to-image tasks and don't possess \HTLiM{}'s broad task adaptability.

Implications and Future Directions

The practical implications of \HTLiM{} are profound, with applications ranging from content generation and summarization to more nuanced information retrieval and disambiguation. The model's bidirectional context consideration makes it a potential fit for more intricate task scenarios needing comprehensive understanding across multimodal data.

Theoretically, the paper paves the way for further exploration into causally masked modeling techniques, potentially extending to other domains beyond text and image. Future developments might include expanding the training data to encompass other modalities, such as audio or video, utilizing the causally masked structure to maintain seamless representation and transformation between diverse information forms.

Overall, \HTLiM{} presents compelling evidence for the efficacy and versatility of causally masked multimodal generative models, significantly broadening the horizon for future AI developments in multimodal content understanding and generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Armen Aghajanyan (31 papers)
  2. Bernie Huang (4 papers)
  3. Candace Ross (25 papers)
  4. Vladimir Karpukhin (13 papers)
  5. Hu Xu (87 papers)
  6. Naman Goyal (37 papers)
  7. Dmytro Okhonko (11 papers)
  8. Mandar Joshi (24 papers)
  9. Gargi Ghosh (30 papers)
  10. Mike Lewis (78 papers)
  11. Luke Zettlemoyer (225 papers)
Citations (145)
Youtube Logo Streamline Icon: https://streamlinehq.com