xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations (2408.12590v2)

Published 22 Aug 2024 in cs.CV and cs.AI

Abstract: We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

PDF HTML Abstract

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

The paper "xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations" presents a text-to-video (T2V) generation model that leverages compressed representations to create realistic video sequences from textual descriptions. This model, named xGen-VideoSyn-1, introduces a novel combination of video variational autoencoding (VidVAE) and diffusion transformer (DiT) architectures, addressing several critical challenges in the domain of T2V synthesis, such as computational efficiency and video quality.

Key Contributions

Video-Specific VAE: The paper proposes a VidVAE that achieves spatial and temporal compression, significantly reducing the computational costs associated with video generation. Unlike traditional models that compress each frame independently, VidVAE incorporates temporal compression, yielding a more efficient encoding of video sequences. This approach allows for the generation of long sequences without the prohibitive computational expense.
Divide-and-Merge Strategy: To handle out-of-memory (OOM) issues during the encoding of long videos, the authors introduce a divide-and-merge strategy. This method segments videos into overlapping chunks for individual processing, which helps maintain temporal consistency and reduces the memory load on computational resources.
Diffusion Transformer Architecture: xGen-VideoSyn-1 employs a video diffusion transformer (VDiT) that extends traditional latent diffusion models to videos. The VDiT uses spatial and temporal self-attention layers to capture video dynamics and improves the generalization capability across different resolutions and aspect ratios. The model utilizes a combination of rotary positional embeddings (ROPE) and sinusoidal encodings to effectively represent spatial and temporal information.
Extensive Dataset and Data Processing Pipeline: The paper details an automated and scalable data processing pipeline designed to collect over 13 million high-quality video-text pairs. This pipeline includes deduplication, optical character recognition (OCR), motion detection, aesthetic scoring, and dense captioning. The collected dataset plays a crucial role in training the VidVAE and DiT models, enabling them to achieve competitive performance.

Quantitative and Qualitative Results

Training the VidVAE and DiT models required significant computational resources, amounting to approximately 40 and 642 H100 GPU days, respectively. The xGen-VideoSyn-1 model supports up to 14-second 720p video generation end-to-end. The model's performance was evaluated on multiple metrics, with VBench providing a comprehensive analysis.

The model showcases strong performance across various dimensions:

Consistency: The generated videos maintain high subject and background consistency over time.
Temporal Performance: The model achieves smooth motion and minimal temporal flickering, important for realistic video generation.
Aesthetic Quality: xGen-VideoSyn-1 produces videos with high aesthetic value and image quality, outperforming several baselines.

A user paper further validates the model's superior performance, indicating more than a 15% preference over competing methods such as OpenSora V1.1.

Theoretical and Practical Implications

The research introduces a scalable approach to T2V generation, highlighting the potential of compressed representations in reducing computational demands without compromising video quality. The divide-and-merge strategy and VidVAE's temporal compression represent significant advancements in handling long video sequences efficiently.

From a practical perspective, this model offers robust capabilities for various applications, including content creation, animation, and virtual reality. Its ability to generate diverse styles and high-quality content based on textual descriptions opens new avenues for automated media production.

Future Developments

The paper hints at several future directions:

Model Scaling: Enhancing the model's architecture and increasing its parameters could further improve video quality and style fidelity.
Expanded Dataset: Incorporating more diverse and extensive datasets could enrich the model's understanding and generation capabilities.
Real-time Processing: Optimizing the model for real-time video generation could make it suitable for interactive applications.

In conclusion, "xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations" makes a substantial contribution to the field of T2V generation. By integrating advanced video-specific VAE and diffusion transformer architectures, coupled with a comprehensive data collection pipeline, the model achieves a compelling balance between computational efficiency and video quality, paving the way for future innovations in AI-driven video synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Can Qin (37 papers)
Congying Xia (32 papers)
Krithika Ramakrishnan (1 paper)
Michael Ryoo (12 papers)
Lifu Tu (19 papers)
Yihao Feng (35 papers)
Manli Shu (23 papers)
Honglu Zhou (21 papers)
Anas Awadalla (12 papers)
Jun Wang (991 papers)
Senthil Purushwalkam (23 papers)
Le Xue (23 papers)
Yingbo Zhou (81 papers)
Huan Wang (211 papers)
Silvio Savarese (200 papers)
Juan Carlos Niebles (95 papers)
Zeyuan Chen (41 papers)
Ran Xu (89 papers)
Caiming Xiong (337 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1826800078591590530

https://twitter.com/UdariMadhu/status/1827778885381173752

https://twitter.com/arXivGPT/status/1827421350484676957

https://twitter.com/javaeeeee1/status/1827102292631548148

https://twitter.com/arxivsanitybot/status/1827531167232344439

https://twitter.com/arXivGPT/status/1827783705760354360