Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Photorealistic Video Generation with Diffusion Models (2312.06662v1)

Published 11 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

Introduction to Photorealistic Video Generation

The field of AI-generated content has made significant strides, and a recent breakthrough showcases an advanced method for creating photorealistic videos from textual descriptions. This innovative approach leverages the power of diffusion models, which are a class of generative models that have gained traction for producing high-quality images. The model, known as W.A.L.T, utilizes a transformer-based architecture to accomplish this feat.

The Mechanics of W.A.L.T

The core innovation of W.A.L.T comprises two pivotal design choices: firstly, a causal encoder compresses both images and videos into a unified latent space, which enables the model to train across different formats efficiently. Secondly, a window attention architecture enhances memory and training efficiency, fundamental for handling the demanding task of video generation.

The unique pipeline involves a combination of three trained models. The process begins with a base latent video diffusion model, followed by two stages of video super-resolution models. These stages upscale the generated content to the desired high-resolution output, achieving impressive detail and temporal consistency in the resulting videos.

Performance and Training Efficiency

In benchmark tests, W.A.L.T has demonstrated remarkable results in generating class-conditional videos and has also exhibited significant aptitude on image generation benchmarks. Importantly, this has been achieved without the use of intensive computational techniques such as classifier-free guidance, signaling not just quality but also efficiency in the model's performance.

W.A.L.T's configurations allow for a harmonious balance between the number of model parameters and the quality of video generation. It showcases the importance of making strategic decisions on resource allocation within the model architecture to optimize both fidelity and computational load.

Innovating Beyond Still Images

While still imagery has seen considerable progress in generative modeling, video synthesis has lagged. The release of W.A.L.T is a notable push forward, demonstrating that high-resolution, temporally coherent videos can be generated effectively from textual descriptions. This opens avenues for a range of applications, from content creation to potential uses in virtual reality, simulations, and more.

W.A.L.T stands out for its ability to be jointly trained on both image and video datasets, which allows it to capitalize on the vast amount of image data available, in contrast to the sparser video datasets. The joint training methodology significantly benefits the model's performance, contributing to the more detailed and accurate video outputs.

Future Paths and Conclusion

W.A.L.T's success underscores the potential of scaling up the unified framework for image and video generation to close the existing gap between the two. The model's efficiency and the quality signal a new horizon in AI-driven content generation, where the boundaries of creativity and automation continue to expand, potentially transforming how visual media is produced and consumed in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. MusicLM: Generating music from text. arXiv:2301.11325, 2023.
  2. All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  3. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  5. Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
  6. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2018.
  7. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  8. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  9. A short note about Kinetics-600. arXiv:1808.01340, 2018.
  10. MaskGIT: Masked generative image transformer. In CVPR, 2022.
  11. Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
  12. Pixart-\a⁢l⁢p⁢h⁢a\absent𝑎𝑙𝑝ℎ𝑎\backslash alpha\ italic_a italic_l italic_p italic_h italic_a: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  13. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
  14. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  15. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  16. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
  17. CogView: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  19. A learned representation for artistic style. arXiv preprint arXiv:1610.07629, 2016.
  20. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  21. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  22. Masked diffusion transformer is a strong image synthesizer. arXiv:2303.14389, 2023.
  23. Long video generation with time-agnostic VQGAN and time-sensitive transformer. In ECCV, 2022.
  24. Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
  25. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  26. Google. PaLM 2 technical report. arXiv:2305.10403, 2023.
  27. MaskViT: Masked visual pre-training for video prediction. In ICLR, 2022.
  28. Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
  29. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
  30. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  31. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2023.
  32. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  33. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  34. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022a.
  35. Cascaded diffusion models for high fidelity image generation. JMLR, 23(1):2249–2281, 2022b.
  36. Video diffusion models. In ICLR Workshops, 2022c.
  37. CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868, 2022.
  38. simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023.
  39. Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, 2021.
  40. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  41. Scalable adaptive computation for iterative generation. In ICML, 2023.
  42. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34:14745–14758, 2021.
  43. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  44. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  45. Understanding the diffusion objective as a weighted integral of elbos. arXiv:2303.00848, 2023.
  46. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  47. Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589, 2021.
  48. Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023.
  49. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  50. Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:2305.13311, 2023.
  51. Transformation-based adversarial video prediction on large-scale data. arXiv:2003.04035, 2020.
  52. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  53. Scalable diffusion models with transformers. arXiv:2212.09748, 2022.
  54. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  55. Improving language understanding by generative pre-training. 2018.
  56. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  57. Learning transferable visual models from natural language supervision. In ICML, 2021.
  58. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  59. Zero-shot text-to-image generation. In ICML, 2021.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer. 2019.
  61. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  62. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  63. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  64. Improved techniques for training GANs. In NeurIPS, 2016.
  65. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021.
  66. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, 2022.
  67. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  68. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  69. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  70. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402, 2012.
  71. Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717, 2018.
  72. Neural discrete representation learning. In NeurIPS, 2017.
  73. Attention is all you need. In NeurIPS, 2017.
  74. Phenaki: Variable length video generation from open domain textual description. arXiv:2210.02399, 2022.
  75. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
  76. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  77. Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022.
  78. VideoGPT: Video generation using VQ-VAE and transformers. arXiv:2104.10157, 2021.
  79. Vector-quantized image modeling with improved VQGAN. In ICLR, 2022a.
  80. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022b.
  81. MAGVIT: Masked generative video transformer. In CVPR, 2023a.
  82. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023b.
  83. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023c.
  84. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  85. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11304–11314, 2022.
  86. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  87. Fast training of diffusion models with masked transformers. arXiv:2306.09305, 2023.
  88. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  89. RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Agrim Gupta (26 papers)
  2. Lijun Yu (22 papers)
  3. Kihyuk Sohn (54 papers)
  4. Xiuye Gu (17 papers)
  5. Meera Hahn (15 papers)
  6. Li Fei-Fei (199 papers)
  7. Irfan Essa (91 papers)
  8. Lu Jiang (90 papers)
  9. José Lezama (19 papers)
Citations (108)
Youtube Logo Streamline Icon: https://streamlinehq.com