The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606v3)

Published 6 Mar 2025 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive LLMs and diffusion models. However, each paradigm has intrinsic limitations: LLMs struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a LLM that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

PDF Abstract

Integrating LLMs and Diffusion Models for Video Generation

The paper "The Best of Both Worlds: Integrating LLMs and Diffusion Models for Video Generation" introduces a novel video generation framework, LanDiff, which unites the capabilities of autoregressive LLMs and diffusion models. This hybrid framework addresses the inherent limitations of each paradigm by leveraging their complementary strengths through a coarse-to-fine generative approach.

Framework Overview

The primary innovations of LanDiff include a semantic tokenizer, a semantic token generation mechanism, and a diffusion-based refinement process.

Semantic Tokenizer: The paper introduces a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations utilizing efficient semantic compression, achieving a remarkable compression ratio of approximately 14,000:1. This tokenizer allows the extraction of high-level semantic information with minimal bit consumption, thereby enabling effective video representation with a drastic reduction in token sequence length compared to existing methods like Magvit2.
Autoregressive Semantic Token Generation: The LLM in LanDiff predicts semantic tokens which encapsulate high-level semantic relationships. This approach contrasts with previous works that directly generate perceptual features. Moreover, inspired by the MP4 video encoding algorithm, the model divides video frames into keyframes (I-frames) and non-keyframes (P-frames), further optimizing token compression by reducing temporal redundancy.
Diffusion Model for Refinement: A streaming diffusion model refines the generated semantic tokens into high-fidelity videos, adding perceptual details and overcoming the typical visual quality limitations seen in LLMs alone.

Experimental Results

LanDiff demonstrates superior performance on the VBench T2V benchmark, achieving a score of 85.43, which surpasses prominent models like the 13B Hunyuan Video. The framework also shows excellence in generating long videos, a known challenging task for current models. The empirical results highlight the model's superiority in semantic score, quality score, and maintaining spatial and temporal coherence. LanDiff effectively maneuvers the balance between semantic fidelity and visual detail, ensuring high-quality, semantically accurate video generation.

Implications and Future Directions

The integration of LLMs with diffusion processes as proposed in LanDiff could redefine approaches in generative models, particularly for tasks requiring the synthesis of high-dimensional structured data such as video. This paradigm not only bridges semantic understanding with visual realism but also enhances scalability, enabling applications across diverse domains, including animation, virtual reality, and beyond.

Looking forward, expansions on LanDiff's architecture could explore improved handling of even larger-scale datasets, more granular text-to-video descriptor mappings, and further optimizations in the melding of LLM and diffusion model architectures. Additionally, investigating domain-specific adaptations of this framework could provide deeper insights into bespoke video generation applications in tailored contexts.

In conclusion, by integrating LLMs' semantic generation capabilities with the iterative refinement of diffusion models, LanDiff represents a significant step forward in the synthesis of coherent, high-fidelity video from textual descriptions, paving the way for future advancements in AI-driven video generation technology.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Aoxiong Yin (12 papers)
Kai Shen (29 papers)
Yichong Leng (27 papers)
Xu Tan (164 papers)
Xinyu Zhou (82 papers)
Juncheng Li (121 papers)
Siliang Tang (116 papers)

Related Papers

Find Related Papers

GitHub

LanDiff

Tweets

https://twitter.com/primus_ai/status/1899172581497803054

https://twitter.com/snk4tr/status/1904799663632396290

YouTube

Show All Videos