- The paper proposes a single-transformer framework that performs both autoregressive text generation and diffusion-based image generation using distinct attention masks.
- Its methodology employs causal masks for text tasks and bidirectional masks for image diffusion, demonstrating competitive performance on benchmarks like ImageNet.
- Experimental results highlight state-of-the-art image quality (FID 2.57) and robust text generation performance, despite challenges from mixed training data.
The paper entitled "MonoFormer: One Transformer for Both Diffusion and Autoregression" proposes a consolidated approach leveraging a single transformer model for both autoregressive-based text generation and diffusion-based image generation. This is achieved without separate architectures, which diverges from the prevalent practice in multimodal generation models.
Core Idea and Motivation
Traditionally, separate models or the same model with visual data discretization are employed for text (autoregressive) and image (diffusion) generation tasks. The authors observed that transformer architectures have been successfully applied to both autoregression and diffusion tasks individually, where the primary distinction lies within the attention masks utilized (causal for autoregression and bidirectional for diffusion). Building on this insight, the authors introduce the MonoFormer framework, which shares a single transformer, thus simplifying architecture and training processes.
Methodology
MonoFormer hinges on the ability of transformers to handle both modalities through strategic attention masking:
- Autoregressive Task: Utilizes a causal attention mask, ensuring each token only attends to preceding tokens.
- Diffusion Task: Employs a bidirectional attention mask, with no positional constraints.
The transformer architecture used is a standard LLM transformer, initialized with TinyLlama-1.1B v1.0, and subsequently optimized through tasks involving both autoregression and diffusion. The overall loss function combines the autoregression loss for text-to-text generation with the diffusion loss for text-to-image generation.
Implementation Details
For text-to-text generation, the model auto-regressively produces next-token predictions. During text-to-image generation, the model follows a standard diffusion process where Gaussian noise is iteratively reduced to generate images. Notably, the model leverages a pretrained variational autoencoder (VAE) to encode images into latent representations, later decoded to actual images.
The training dataset amalgamated the JourneyDB and UltraChat datasets, maintaining a higher ratio of image-generation samples to manage the complexity of the task.
Experimental Results
Image Generation:
- Evaluated on ImageNet 256x256 benchmark.
- Metrics: FID, IS, Precision, and Recall.
- MonoFormer achieved an FID of 2.57, outperforming AR-based models like LlamaGen-3B and was comparable to DiT-XL/2, a state-of-the-art diffusion model.
Text Generation:
- Assessed on a variety of commonsense reasoning tasks, including HellaSwag, OpenBookQA, WinoGrande, ARC-C/E, BoolQ, and PIQA.
- MonoFormer demonstrated performance similar to the baseline TinyLlama model, albeit with slight drops potentially due to the mixed training dataset.
Ablations
Transformers initialized with pretrained LLMs significantly outperformed non-pretrained counterparts in both image and text generation tasks. Additionally, bidirectional attention masks for diffusion tasks yielded better performance over causal masks, emphasizing their necessity for image generation.
Implications and Future Directions
MonoFormer presents a promising step towards unified multimodal generation models, potentially simplifying architecture design and reducing computational overhead. The implications broaden to the integration of multimodal understanding, as future research could extend the MonoFormer to vision-language understanding tasks by processing images in an autoregressive manner for comprehension and diffusion for generation. Additionally, enriching the language component of the training dataset might mitigate the slight drops observed in text generation performance.
Given the comparable performance with state-of-the-art models in both image and text domains, MonoFormer positions itself as a feasible option for multimodal generation, promising a blend of architectural simplicity and robust performance.
Conclusion
The MonoFormer framework brings forth an innovative yet straightforward concept—leveraging a single transformer for both autoregression and diffusion tasks. The results indicate successful integration of text-to-text and text-to-image tasks within a single model, matching or exceeding current benchmarks. Future research may further enhance its applicability across broader multimodal tasks, potentially leading to more cohesive and efficient AI models.
This essay provided a comprehensive summary of the paper, emphasizing experimental results and their significance, while adhering to formal academic writing principles. It discerned the methodological innovations and their potential impacts on future AI developments.