JetFormer: An Autoregressive Generative Model of Raw Images and Text (2411.19722v1)

Published 29 Nov 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.

Summary

The paper demonstrates that JetFormer unifies text and image generative tasks using a single autoregressive transformer architecture.
It introduces a normalizing flow mechanism to convert images into soft-token representations, enabling end-to-end training and effective likelihood maximization.
The model achieves competitive results on ImageNet and multimodal tasks, suggesting improved training efficiency and integration simplicity.

Overview of JetFormer: An Autoregressive Generative Model for Images and Text

The paper presents "JetFormer," a new model architecture aimed at unified generative modeling across text and image data using an autoregressive decoder-only transformer architecture. JetFormer addresses the structural inefficiencies often found in multimodal models by simplifying the architecture and removing dependencies on separately pretrained components. Instead, it integrates a normalizing flow model to efficiently handle soft-token image representations, thereby allowing end-to-end training and enabling both text and visual tasks.

Core Contributions and Methodology

Unified Model Architecture: JetFormer employs a single autoregressive transformer model to simultaneously process discrete text tokens and continuous-image representations. This is achieved without incorporating pre-trained modality-specific encoders, which are common in many existing multimodal architectures.
Image Representation via Normalizing Flow: The model utilizes a normalizing flow mechanism to convert image data into a sequence of soft tokens that can be further processed by the transformer. This component acts bidirectionally: as an encoder during perception tasks and a decoder for image generation.
Training Objective and Method: The primary training objective is to maximize the log-likelihood of the raw input data. As typical with autoregressive models, JetFormer employs a likelihood-based training approach that effectively minimizes loss over both textual and visual tokens. It deploys a Gaussian mixture model (GMM) loss to handle the continuous image feature space.
Improvement Techniques: To optimize training for global image coherence, JetFormer introduces a noise curriculum. During training, Gaussian noise is added to images, which gradually reduces, thereby guiding the model to prioritize higher-level image features early in the training cycle.

Experimental Results

JetFormer is evaluated on class-conditional image generation using the ImageNet dataset and on web-scale multimodal tasks. Its text-to-image generation quality shows competitive results compared to recent models utilizing VQVAE or VAE frameworks. One notable claim is JetFormer’s high recall rate, attributed to its explicit likelihood modeling which inherently avoids mode collapse, a known issue in GAN-based architectures.

In particular, the model demonstrates strong performance in tasks involving vision-language understanding, highlighting its versatility across modalities without sacrificing performance in any single domain.

Implications and Future Directions

By achieving competitive quality without separate pretrained encoders, JetFormer sets a precedent for constructing streamlined generative models capable of handling multiple modalities. This architectural consolidation could lead to efficiency improvements in both computational cost and integration simplicity in diverse machine learning applications.

Future research directions may explore further optimization of the model's scalability and sample efficiency. Additionally, while equipping the model with interpretability tools to scrutinize the trade-offs in decision-making processes across languages and visual domains could offer nuanced insights.

Overall, JetFormer stands as a potent demonstration of the capability of unified autoregressive models in contextually rich multimodal environments. Its contribution lies not only in the generation quality but also in promoting a more integrated approach to neural architecture design for simultaneous text and image data processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mtschannen/status/1863622784376586499

https://twitter.com/gm8xx8/status/1863463279567294568

https://twitter.com/ducha_aiki/status/1864025479411327033

https://twitter.com/mtschannen/status/1906021360800788688

https://twitter.com/norpadon/status/1905275499632275777