Unified Multimodal Discrete Diffusion (2503.20853v1)

Published 26 Mar 2025 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at https://unidisc.github.io.

PDF Abstract

Unified Multimodal Discrete Diffusion

The research paper titled "Unified Multimodal Discrete Diffusion" introduces an innovative approach in the domain of multimodal generative models, aiming to enhance their efficacy in tasks involving text-image generation and understanding. The paper takes a decisive step away from the conventional autoregressive (AR) techniques, which follow a sequential token processing scheme, towards discrete diffusion models that offer a more unified approach to handling multimodal data.

Overview

Multimodal generative models have traditionally been dominated by autoregressive methods that encode and decode sequences of data tokens sequentially, adhering to fixed orderings such as left-to-right for text and raster order for images. While these methods have successfully catered to a wide array of tasks like image captioning, question answering, and image generation, they suffer from inherent inefficiencies related to inference speed and controllability. This paper leverages discrete diffusion models, recognizing their recent success in tackling text generation efficiently, to propose the Unified Multimodal Discrete Diffusion (UniDisc) model.

Diffusion models, typically continuous, employ processes that corrupt data with Gaussian noise and learn to denoise effectively. Discrete diffusion adopts a different approach by using categorical distributions for transitions, thereby aligning better with the discrete nature of both text and image data tokens. The discrete diffusion framework proposed here operates on a unified vocabulary of image and text tokens, masking them during training and learning to predict the visible sequence during inference. This inherently non-sequential approach facilitates joint text and image inpainting, a task challenging for traditional AR models unless explicitly trained.

Key Contributions

Unified Training Architecture: The UniDisc model employs bidirectional transformer architectures with specialized position embeddings for image and text tokens, integrating modality-specific embeddings for enhanced representation capabilities. This design choice allows simultaneous decoding of text and image tokens, overcoming the inefficiencies of AR models.
Classifier-Free Guidance: By incorporating classifier-free guidance—a feature adapted from continuous diffusion models—UniDisc significantly bolsters controllability during generation. This technique allows balancing between generation quality and diversity, significantly outperforming AR models in conditional generation tasks.
Inference Efficiency: The paper highlights decisive improvements in inference efficiency, with fewer sampling steps required compared to AR models, facilitating significant reductions in computational overhead. The modality-specific caching mechanism further enhances speed, allowing cached image token states to reduce redundant computations.
Discriminative Capabilities: Beyond generation tasks, UniDisc demonstrates superior performance in discriminative tasks such as multimodal retrieval and visual reasoning, suggesting robust underlying representation capabilities.
Scaling and Zero-Shot Capabilities: UniDisc scales effectively to larger models, leveraging extensive datasets without compromising guidance or diversifying tasks, maintaining high performance across operations like zero-shot generation and modality-specific extrapolation.

Implications and Future Scope

The emergence of discrete diffusion models, particularly as demonstrated by UniDisc, presages a shift in how multimodal generative tasks are approached. The inherent flexibility in processing modalities jointly and improving control during generation mark significant advancements in the field. Notably, the ability of UniDisc to execute joint text and image inpainting tasks without explicit training indicates a promising direction for future AI systems that require dynamic adaptability across various data modalities.

Theoretical implications suggest clearer pathways for integrating diverse modality data sources into AI systems, supporting more sophisticated understanding and contextual reasoning capabilities. Practically, the reduced computational load and faster inference offer tangible benefits for scaling deployments and integrating these models into real-time systems.

Nonetheless, the scalability of discrete diffusion models remains a relevant avenue for continued research, particularly regarding their training efficiency compared to AR counterparts. The integration with large-scale datasets and the refinement of the intrinsic sampling strategies could further enhance the applicability and effectiveness of these models in broader AI domains.

In summary, the paper lays a robust foundation for discrete diffusion models as unified multimodal generative solutions. It emphasizes the advantages over traditional AR methods and sets the stage for future research that may expand the utility of these models in other challenging AI tasks.