Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

Published 8 Feb 2025 in cs.CV and cs.AI | (2502.05415v2)

Abstract: Consistency models (CMs) have shown promise in the efficient generation of both image and text. This raises the natural question of whether we can learn a unified CM for efficient multimodal generation (e.g., text-to-image) and understanding (e.g., image-to-text). Intuitively, such a model could be acquired by applying the consistency distillation (CD) to existing unified multimodal models. However, the key challenge is establishing a unified denoising perspective for both image and text generation, which is essential for establishing the consistency mapping. To tackle this, at the representation level, we advocate for discrete tokens for both modalities to best preserve language modeling capabilities. Critically, instead of defining the text denoising trajectory via recent discrete diffusion language modeling principles, we specify it using the parallel decoding trace of an autoregressive LLM, benefiting from the latter's superior performance in general text generation tasks. The denoising trajectory of image tokens adheres to standard discrete diffusion. We train our unified consistency models (UniCMs) on these combined multimodal trajectories simultaneously with a unified objective. We introduce a trajectory segmentation strategy to further improve the training convergence. Empirically, in text-to-image generation, UniCMs outperform SD3 on GenEval, Image Reward, and CLIP Score metrics, while requiring only approximately ${1}/{8}$ of the sampling time. Meanwhile, in image-to-text generation, UniCMs surpass Show-o on the MMMU benchmark while being $1.5 \times$ faster at long-sequence generating speed. The code is available at https://github.com/zhijie-group/UniCMs.

Summary

  • The paper's main contribution is introducing a unified denoising approach that integrates text and image generation using consistency distillation.
  • It employs parallel decoding and trajectory segmentation to markedly reduce sampling steps while achieving improved GenEval, HPS, IR, and CLIP scores.
  • Results demonstrate a 1.5x speedup in image-to-text tasks and robust multimodal performance, highlighting practical efficiency gains.

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

Introduction

The paper introduces "Show-o Turbo", an enhanced version of Show-o, a multimodal generative model capable of both text-to-image (T2I) and image-to-text (I2T) tasks. Show-o combines discrete diffusion for images and autoregressive decoding for text, but suffers from inefficiency due to lengthy sampling processes required for both modalities. Show-o Turbo addresses this inefficiency by unifying the generation process under a denoising perspective, leveraging parallel decoding algorithms for text and consistency distillation (CD) principles inspired by diffusion model acceleration.

Methodology

Unified Denoising Perspective

The authors propose a unified denoising view for both image and text generation by utilizing parallel text decoding algorithms, such as Jacobi Decoding. This approach iteratively refines text tokens in parallel, akin to the denoising pattern observed in image generation (Figure 1). Figure 1

Figure 1: Illustration of the sampling trajectories of text and image tokens in Show-o. As shown, they both display a denoising pattern.

Consistency Distillation

Show-o Turbo applies consistency distillation by mapping any point on the sampling trajectory to the same endpoint. This adaptation of CD to multimodal trajectories, combined with trajectory segmentation and curriculum learning strategies, aids in improving training convergence, enabling efficient sampling with fewer steps.

Implementation Details

To ensure convergence, a combination of global consistency loss and regularizations related to text generation and prediction distributions is used. For efficient training, trajectory segmentation divides the sampling process into segments with decreasing lengths, while curriculum learning adapts the learning objectives as the student model deviates from the teacher's trajectories.

Results

Text-to-Image and Image-to-Text Generation

Empirically, Show-o Turbo achieves significant improvements in generation speed without notable performance degradation. For T2I generation, the model delivers superior GenEval, Human Preference Score (HPS), ImageReward (IR), and CLIP Score (CS) metrics, even in reduced sampling steps compared to Show-o with classifier-free guidance (CFG) (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Comparison between Show-o and Show-o Turbo on 512 resolution in T2I generation. The former crashes in two-step sampling, while the latter maintains good performance.

In image-to-text tasks, Show-o Turbo executes with a 1.5x speedup, maintaining competitive accuracy on benchmarks like Flickr30K and NoCaps and robustness across multiple-choice MMU tasks.

Multimodal Understanding

For MMU tasks requiring one-token responses, Show-o Turbo retains strong question-answering capabilities, reinforcing the model's broad applicability across diverse multimodal scenarios.

Limitations and Future Work

The primary trade-off observed is between acceleration and performance in description tasks. The authors suggest using more sophisticated MMU corpora for distillation as a potential optimization approach. Additionally, extending the consistency distillation principle to more complex or hierarchical sampling trajectories could further enhance efficiency without sacrificing accuracy.

Conclusion

Show-o Turbo demonstrates effective acceleration strategies for multimedia generative models through unified consistency modeling and parallel decoding techniques. By addressing inefficiencies endemic to discrete models through robust distillation processes and innovative training methodologies, this adaptation sets a new standard for multimodal generation without relying on additional guidance mechanisms such as CFG. This contribution opens pathways for further exploration in unified models capable of efficiently handling diverse generative and understanding tasks across modalities.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.