Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis (2402.14797v1)

Published 22 Feb 2024 in cs.CV and cs.AI

Abstract: Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

References (72)

Authors (11)

Willi Menapace (33 papers)
Aliaksandr Siarohin (58 papers)
Ivan Skorokhodov (38 papers)
Ekaterina Deyneka (2 papers)
Tsai-Shien Chen (9 papers)
Anil Kag (16 papers)
Yuwei Fang (31 papers)
Aleksei Stoliar (1 paper)
Elisa Ricci (137 papers)
Jian Ren (97 papers)
Sergey Tulyakov (108 papers)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces Snap Video, which compresses spatial and temporal dimensions into a single latent vector for efficient text-to-video synthesis.
It achieves a 3.31x faster training time and approximately 4.5x quicker inference compared to U-Net architectures, enhancing scalability and performance.
State-of-the-art results on benchmarks like UCF101 and MSR-VTT demonstrate superior photorealism, motion quality, and text alignment in generated videos.

Snap Video: Enhancing Text-to-Video Synthesis with Spatiotemporal Transformers

Introduction

The field of generative AI has witnessed significant advancements, particularly in text-to-image synthesis, resulting in the generation of highly realistic and diverse images. Building upon this success, there is a growing interest in extending these capabilities to text-to-video synthesis. However, directly applying architectures and techniques developed for image models to video generation faces significant challenges due to the inherent differences between static images and dynamic video content. This includes dealing with spatial and temporal redundancies, ensuring motion fidelity, and maintaining visual quality, all while managing computational efficiency.

Addressing the Challenges of Video Generation

In response to these challenges, this paper introduces Snap Video, a novel approach that leverages spatiotemporal transformers to efficiently generate high-quality videos from text descriptions. The work innovatively adapts the EDM framework for high-dimensional inputs and proposes a transformer-based architecture, achieving notable improvements in training and inference times, scalability, and video quality compared to existing U-Net-based models.

Spatiotemporal Transformers for Video Synthesis: Snap Video differentiates itself by treating spatial and temporal dimensions as a single, compressed, 1D latent vector. This method efficiently captures the dynamics and complexities of video content, leading to richer motion modeling and better temporal consistency in generated videos.
Performance and Scalability: One of the remarkable achievements of Snap Video is its performance in training and inference. The proposed architecture achieves a 3.31 times faster training time and approximately 4.5 times quicker inference compared to conventional U-Net architectures, thus facilitating the training of highly parametrized models for text-to-video synthesis.
State-of-the-Art Results: Snap Video demonstrates superior performance across various benchmarks, including UCF101 and MSR-VTT datasets. It significantly enhances the generated video's quality, motion complexity, and temporal consistency. Notably, Snap Video's ability to generate videos that are preferred by users in aspects such as photorealism, text alignment, and motion quality further underscores its effectiveness.

Future Directions and Theoretical Implications

The success of Snap Video in addressing the unique challenges of text-to-video synthesis opens up new avenues for research in generative AI. The introduction of spatiotemporal transformers represents a pivotal shift towards more flexible and efficient models capable of handling the complexities of video generation.

Exploring Further Applications: The advancements demonstrated by Snap Video can potentially be extended to other areas such as video editing, animation, and even virtual reality, where generating high-quality dynamic content is crucial.
Impact on Large-scale Model Training: The efficiencies introduced in the training and inference process also set a precedent for developing even larger models capable of capturing finer nuances in video content.
Cross-modal Learning: The performance of Snap Video in maintaining text-to-video alignment highlights the possibilities in cross-modal learning and understanding, which could lead to more cohesive and contextually accurate generative models.

Conclusion

Snap Video marks a significant advancement in the domain of text-to-video synthesis, demonstrating the potential of spatiotemporal transformers in generating high-quality, temporally consistent, and motion-rich videos from textual descriptions. By addressing the inherent limitations of traditional architectures and proposing an efficient, scalable model, this work not only sets new benchmarks in video generation but also lays the groundwork for future innovations in generative AI.

PDF Markdown

Tweets

https://twitter.com/studyfang_/status/1761043443860840703

https://twitter.com/gowthami_s/status/1761400327016772006

https://twitter.com/_akhaliq/status/1760854985988780112

https://twitter.com/knishimae0531/status/1760875344608047323

https://twitter.com/javaeeeee1/status/1761007628803567731

https://twitter.com/gm8xx8/status/1760855815123091697