Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MonoFormer: One Transformer for Both Diffusion and Autoregression (2409.16280v1)

Published 24 Sep 2024 in cs.CV

Abstract: Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.

Citations (5)

Summary

  • The paper proposes a single-transformer framework that performs both autoregressive text generation and diffusion-based image generation using distinct attention masks.
  • Its methodology employs causal masks for text tasks and bidirectional masks for image diffusion, demonstrating competitive performance on benchmarks like ImageNet.
  • Experimental results highlight state-of-the-art image quality (FID 2.57) and robust text generation performance, despite challenges from mixed training data.

MonoFormer: One Transformer for Both Diffusion and Autoregression

The paper entitled "MonoFormer: One Transformer for Both Diffusion and Autoregression" proposes a consolidated approach leveraging a single transformer model for both autoregressive-based text generation and diffusion-based image generation. This is achieved without separate architectures, which diverges from the prevalent practice in multimodal generation models.

Core Idea and Motivation

Traditionally, separate models or the same model with visual data discretization are employed for text (autoregressive) and image (diffusion) generation tasks. The authors observed that transformer architectures have been successfully applied to both autoregression and diffusion tasks individually, where the primary distinction lies within the attention masks utilized (causal for autoregression and bidirectional for diffusion). Building on this insight, the authors introduce the MonoFormer framework, which shares a single transformer, thus simplifying architecture and training processes.

Methodology

MonoFormer hinges on the ability of transformers to handle both modalities through strategic attention masking:

  • Autoregressive Task: Utilizes a causal attention mask, ensuring each token only attends to preceding tokens.
  • Diffusion Task: Employs a bidirectional attention mask, with no positional constraints.

The transformer architecture used is a standard LLM transformer, initialized with TinyLlama-1.1B v1.0, and subsequently optimized through tasks involving both autoregression and diffusion. The overall loss function combines the autoregression loss for text-to-text generation with the diffusion loss for text-to-image generation.

Implementation Details

For text-to-text generation, the model auto-regressively produces next-token predictions. During text-to-image generation, the model follows a standard diffusion process where Gaussian noise is iteratively reduced to generate images. Notably, the model leverages a pretrained variational autoencoder (VAE) to encode images into latent representations, later decoded to actual images.

The training dataset amalgamated the JourneyDB and UltraChat datasets, maintaining a higher ratio of image-generation samples to manage the complexity of the task.

Experimental Results

Image Generation:

  • Evaluated on ImageNet 256x256 benchmark.
  • Metrics: FID, IS, Precision, and Recall.
  • MonoFormer achieved an FID of 2.57, outperforming AR-based models like LlamaGen-3B and was comparable to DiT-XL/2, a state-of-the-art diffusion model.

Text Generation:

  • Assessed on a variety of commonsense reasoning tasks, including HellaSwag, OpenBookQA, WinoGrande, ARC-C/E, BoolQ, and PIQA.
  • MonoFormer demonstrated performance similar to the baseline TinyLlama model, albeit with slight drops potentially due to the mixed training dataset.

Ablations

Transformers initialized with pretrained LLMs significantly outperformed non-pretrained counterparts in both image and text generation tasks. Additionally, bidirectional attention masks for diffusion tasks yielded better performance over causal masks, emphasizing their necessity for image generation.

Implications and Future Directions

MonoFormer presents a promising step towards unified multimodal generation models, potentially simplifying architecture design and reducing computational overhead. The implications broaden to the integration of multimodal understanding, as future research could extend the MonoFormer to vision-language understanding tasks by processing images in an autoregressive manner for comprehension and diffusion for generation. Additionally, enriching the language component of the training dataset might mitigate the slight drops observed in text generation performance.

Given the comparable performance with state-of-the-art models in both image and text domains, MonoFormer positions itself as a feasible option for multimodal generation, promising a blend of architectural simplicity and robust performance.

Conclusion

The MonoFormer framework brings forth an innovative yet straightforward concept—leveraging a single transformer for both autoregression and diffusion tasks. The results indicate successful integration of text-to-text and text-to-image tasks within a single model, matching or exceeding current benchmarks. Future research may further enhance its applicability across broader multimodal tasks, potentially leading to more cohesive and efficient AI models.

This essay provided a comprehensive summary of the paper, emphasizing experimental results and their significance, while adhering to formal academic writing principles. It discerned the methodological innovations and their potential impacts on future AI developments.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com