Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VideoPoet: A Large Language Model for Zero-Shot Video Generation (2312.14125v4)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: We present VideoPoet, a LLM capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of LLMs, consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

An Overview of VideoPoet: A LLM for Zero-Shot Video Generation

The paper introduces VideoPoet, a novel approach to video generation leveraging LLMs to accomplish a diverse set of video generation tasks, conditioned on various multimodal inputs including image, video, text, and audio. The authors propose a unique framework that bridges the gap between LLMs and video generation, areas predominantly led by diffusion models.

Architectural Innovation

VideoPoet employs a decoder-only transformer architecture, a common choice in LLMs, to process and generate videos. This setup includes three core components:

  1. Tokenizers: Modality-specific tokenizers convert inputs into discrete tokens. The MAGVIT-v2 tokenizer is used for images and videos, while the SoundStream tokenizer manages audio inputs. This unified vocabulary allows the model to directly process different modalities.
  2. LLM Backbone: A prefix LLM forms the central component where task-specific prefixes guide the video generation process.
  3. Super-Resolution Module: This addition refines the fidelity of generated video outputs, improving spatial resolutions and detail coherence.

Pretraining and Task Adaptation

The training regime for VideoPoet involves two principal stages: pretraining on large-scale unsupervised data with a variety of video tasks, followed by specific task adaptation. The approach emphasizes effective integration and transformation of foundational tasks from LLMs, pretraining multi-modal models to adapt effectively to video generation tasks.

Experimental Demonstrations

The model showcases its versatility across various tasks such as text-to-video (T2V), image-to-video (I2V), video future prediction, and video stylization. Notably, the model demonstrates:

  • Coherent Long-Video Generation: By iteratively generating video segments based on previous outputs, VideoPoet can generate content extending beyond simple short clips.
  • Zero-shot Video Editing and Chaining of Tasks: The model uniquely combines capabilities to perform non-pretrained specific tasks through chaining, incorporating editing functionalities seamlessly.

Comparative Evaluation

Performance evaluations reflect VideoPoet's competitive edge against existing methods on standard benchmarks like MSR-VTT and UCF-101 without task-specific fine-tuning. Human evaluations further validate its superiority in generating interesting and realistic motion compared to state-of-the-art diffusion models, although some discrepancies remain in crisp aesthetic and text fidelity due to training data choice.

Implications and Future Directions

VideoPoet positions LLMs as viable alternatives to diffusion models for multimodal video generation. By leveraging a unified token framework, VideoPoet introduces flexibility and scalability in handling diverse video tasks. A notable future direction includes enhancing fine-grained details in video outputs and addressing biases in representations as noted in datasets. Adapting such a model potentially evolves into areas requiring intricate narrative constructions and cross-domain video synthesis, widening the application scope.

In conclusion, VideoPoet exemplifies the adaptability of LLMs to video content generation, combining state-of-the-art fidelity with multi-task versatility, establishing a promising foundation for future developments in video-driven applications and AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Alternating gradient descent and mixture-of-experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324, 2023.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023b.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
  8. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  9. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  10. Pix2video: Video editing using image diffusion. In CVPR, pages 23206–23217, 2023.
  11. Stablevideo: Text-driven consistency-aware diffusion video editing. In CVPR, pages 23040–23050, 2023.
  12. Maskgit: Masked generative image transformer. In CVPR, pages 11315–11325, 2022.
  13. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  14. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  15. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
  16. Better may not be fairer: A study on subgroup discrepancy in image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4956–4966, 2023.
  17. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
  18. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  19. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  20. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  21. GLaMs: Efficient scaling of language models with mixture-of-experts. In ICML, 2022.
  22. Taming transformers for high-resolution image synthesis. In CVPR, pages 12868–12878, 2020.
  23. Structure and content-guided video synthesis with diffusion models. In CVPR, pages 7346–7356, 2023.
  24. Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496, 2023.
  25. Preserve your own correlation: A noise prior for video diffusion models. In CVPR, pages 22930–22941, 2023.
  26. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  27. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  28. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  29. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
  30. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2(3):4, 2023.
  31. Cnn architectures for large-scale audio classification. In ICASSP, 2017.
  32. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  33. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  34. Video diffusion models. arXiv:2204.03458, 2022b.
  35. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  36. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  37. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
  38. StarCoder: may the source be with you! arXiv:2305.06161, 2023.
  39. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749, 2023.
  40. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
  41. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  42. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022.
  43. OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  44. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  45. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  47. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
  48. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  49. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3):1623–1637, 2020.
  50. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  51. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  52. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  53. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. IJCV, 128(10):2586–2606, 2020.
  54. A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 916–925, 2021.
  55. Consensus and subjectivity of skin tone annotation for ML fairness. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  56. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  57. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  58. Disentangling architecture and training for optical flow. In ECCV, 2022.
  59. Any-to-any generation via composable diffusion. arXiv preprint arXiv:2305.11846, 2023.
  60. Ul2: Unifying language learning paradigms. In ICLR, 2022.
  61. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  62. Maxvit: Multi-axis vision transformer. In ECCV, pages 459–479, 2022.
  63. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  64. Attention is all you need. NeurIPS, 30, 2017.
  65. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  66. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 35:23371–23385, 2022.
  67. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  68. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
  69. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
  70. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023d.
  71. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  72. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  73. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  74. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  75. Magvit: Masked generative video transformer. In CVPR, pages 10459–10469, 2023a.
  76. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. arXiv preprint arXiv:2306.17842, 2023b.
  77. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023c.
  78. Video probabilistic diffusion models in projected latent space. In CVPR, pages 18456–18466, 2023d.
  79. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  80. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
  81. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  82. Adding conditional control to text-to-image diffusion models. In CVPR, pages 3836–3847, 2023b.
  83. Auditing gender presentation differences in text-to-image models. arXiv preprint arXiv:2302.03675, 2023c.
  84. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  85. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  86. RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (31)
  1. Dan Kondratyuk (11 papers)
  2. Lijun Yu (22 papers)
  3. Xiuye Gu (17 papers)
  4. José Lezama (19 papers)
  5. Jonathan Huang (46 papers)
  6. Rachel Hornung (4 papers)
  7. Hartwig Adam (49 papers)
  8. Hassan Akbari (8 papers)
  9. Yair Alon (3 papers)
  10. Vighnesh Birodkar (16 papers)
  11. Yong Cheng (58 papers)
  12. Ming-Chang Chiu (11 papers)
  13. Josh Dillon (3 papers)
  14. Irfan Essa (91 papers)
  15. Agrim Gupta (26 papers)
  16. Meera Hahn (15 papers)
  17. Anja Hauth (6 papers)
  18. David Hendon (2 papers)
  19. Alonso Martinez (2 papers)
  20. David Minnen (19 papers)
Citations (147)
Youtube Logo Streamline Icon: https://streamlinehq.com