Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis (2402.14797v1)

Published 22 Feb 2024 in cs.CV and cs.AI

Abstract: Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Pika lab discord server. https://www.pika.art/. Accessed: 2023-11-01.
  2. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv, 2023.
  3. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. ArXiv, 2022.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  5. Generating long videos of dynamic scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  6. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  7. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv, 2023.
  8. Fit: Far-reaching interleaved transformers. arXiv, 2023.
  9. Efficient video generation on complex datasets. arXiv, 2019.
  10. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  11. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  12. Long video generation with time-agnostic vqgan and time-sensitive transformer. In Proceedings of the European Conference of Computer Vision (ECCV), 2022.
  13. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  14. f-dm: A multi-stage diffusion model via progressive signal transformation. International Conference on Learning Representations (ICLR), 2023a.
  15. Matryoshka diffusion models. arXiv, 2023b.
  16. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv, 2023.
  17. Latent video diffusion models for high-fidelity long video generation. arXiv, 2023.
  18. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  19. Classifier-free diffusion guidance. arXiv, 2022.
  20. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  21. Imagen video: High definition video generation with diffusion models. arXiv, 2022a.
  22. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022b.
  23. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv, 2022.
  24. Simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
  25. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  26. Adam: A method for stochastic optimization. arXiv, 2015.
  27. The role of imagenet classes in fréchet inception distance. In International Conference on Learning Representations (ICLR), 2023.
  28. Stochastic adversarial video prediction. arXiv, abs/1804.01523, 2018.
  29. Video generation from text. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
  30. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv, 2023.
  31. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv, 2023.
  32. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  33. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology, 2019.
  34. Sync-draw: Automatic video generation using deep recurrent attentive architectures. In Proceedings of the 25th ACM International Conference on Multimedia, 2017.
  35. CCVS: Context-aware controllable video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  36. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 2019.
  37. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  38. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2022.
  40. High-resolution image synthesis with latent diffusion models. arXiv, 2021.
  41. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  42. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  43. Photorealistic text-to-image diffusion models with deep language understanding. arXiv, 2022.
  44. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan, 2020.
  45. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations (ICLR), 2022.
  46. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  47. Mostgan-v: Video generation with temporal motion styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  48. Make-a-video: Text-to-video generation without text-video data. arXiv, 2022.
  49. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  50. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  51. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  52. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  53. Maximum likelihood training of score-based diffusion models, 2021a.
  54. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021b.
  55. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012.
  56. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning (ICML), 2015.
  57. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv, 2023.
  58. A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations (ICLR), 2021.
  59. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  60. Towards accurate generative models of video: A new metric & challenges. arXiv, 2018.
  61. Phenaki: Variable length video generation from open domain textual description. In International Conference on Learning Representations (ICLR), 2023.
  62. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv, 2023.
  63. Godiva: Generating open-domain videos from natural descriptions. ArXiv, 2021.
  64. Nüwa: Visual synthesis pre-training for neural visual world creation. In Proceedings of the European Conference of Computer Vision (ECCV), 2022.
  65. Msr-vtt: A large video description dataset for bridging video and language. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  66. Videogpt: Video generation using vq-vae and transformers. arXiv, 2021.
  67. Nuwa-xl: Diffusion over diffusion for extremely long video generation. In Annual Meeting of the Association for Computational Linguistics, 2023.
  68. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
  69. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022a.
  70. Magvit: Masked generative video transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  71. Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations (ICLR), 2022b.
  72. Magicvideo: Efficient video generation with latent diffusion models. arXiv, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Willi Menapace (33 papers)
  2. Aliaksandr Siarohin (58 papers)
  3. Ivan Skorokhodov (38 papers)
  4. Ekaterina Deyneka (2 papers)
  5. Tsai-Shien Chen (9 papers)
  6. Anil Kag (16 papers)
  7. Yuwei Fang (31 papers)
  8. Aleksei Stoliar (1 paper)
  9. Elisa Ricci (137 papers)
  10. Jian Ren (97 papers)
  11. Sergey Tulyakov (108 papers)
Citations (30)

Summary

  • The paper introduces Snap Video, which compresses spatial and temporal dimensions into a single latent vector for efficient text-to-video synthesis.
  • It achieves a 3.31x faster training time and approximately 4.5x quicker inference compared to U-Net architectures, enhancing scalability and performance.
  • State-of-the-art results on benchmarks like UCF101 and MSR-VTT demonstrate superior photorealism, motion quality, and text alignment in generated videos.

Snap Video: Enhancing Text-to-Video Synthesis with Spatiotemporal Transformers

Introduction

The field of generative AI has witnessed significant advancements, particularly in text-to-image synthesis, resulting in the generation of highly realistic and diverse images. Building upon this success, there is a growing interest in extending these capabilities to text-to-video synthesis. However, directly applying architectures and techniques developed for image models to video generation faces significant challenges due to the inherent differences between static images and dynamic video content. This includes dealing with spatial and temporal redundancies, ensuring motion fidelity, and maintaining visual quality, all while managing computational efficiency.

Addressing the Challenges of Video Generation

In response to these challenges, this paper introduces Snap Video, a novel approach that leverages spatiotemporal transformers to efficiently generate high-quality videos from text descriptions. The work innovatively adapts the EDM framework for high-dimensional inputs and proposes a transformer-based architecture, achieving notable improvements in training and inference times, scalability, and video quality compared to existing U-Net-based models.

  1. Spatiotemporal Transformers for Video Synthesis: Snap Video differentiates itself by treating spatial and temporal dimensions as a single, compressed, 1D latent vector. This method efficiently captures the dynamics and complexities of video content, leading to richer motion modeling and better temporal consistency in generated videos.
  2. Performance and Scalability: One of the remarkable achievements of Snap Video is its performance in training and inference. The proposed architecture achieves a 3.31 times faster training time and approximately 4.5 times quicker inference compared to conventional U-Net architectures, thus facilitating the training of highly parametrized models for text-to-video synthesis.
  3. State-of-the-Art Results: Snap Video demonstrates superior performance across various benchmarks, including UCF101 and MSR-VTT datasets. It significantly enhances the generated video's quality, motion complexity, and temporal consistency. Notably, Snap Video's ability to generate videos that are preferred by users in aspects such as photorealism, text alignment, and motion quality further underscores its effectiveness.

Future Directions and Theoretical Implications

The success of Snap Video in addressing the unique challenges of text-to-video synthesis opens up new avenues for research in generative AI. The introduction of spatiotemporal transformers represents a pivotal shift towards more flexible and efficient models capable of handling the complexities of video generation.

  • Exploring Further Applications: The advancements demonstrated by Snap Video can potentially be extended to other areas such as video editing, animation, and even virtual reality, where generating high-quality dynamic content is crucial.
  • Impact on Large-scale Model Training: The efficiencies introduced in the training and inference process also set a precedent for developing even larger models capable of capturing finer nuances in video content.
  • Cross-modal Learning: The performance of Snap Video in maintaining text-to-video alignment highlights the possibilities in cross-modal learning and understanding, which could lead to more cohesive and contextually accurate generative models.

Conclusion

Snap Video marks a significant advancement in the domain of text-to-video synthesis, demonstrating the potential of spatiotemporal transformers in generating high-quality, temporally consistent, and motion-rich videos from textual descriptions. By addressing the inherent limitations of traditional architectures and proposing an efficient, scalable model, this work not only sets new benchmarks in video generation but also lays the groundwork for future innovations in generative AI.