VideoPoet: A Large Language Model for Zero-Shot Video Generation (2312.14125v4)
Abstract: We present VideoPoet, a LLM capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of LLMs, consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Alternating gradient descent and mixture-of-experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023b.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- Pix2video: Video editing using image diffusion. In CVPR, pages 23206–23217, 2023.
- Stablevideo: Text-driven consistency-aware diffusion video editing. In CVPR, pages 23040–23050, 2023.
- Maskgit: Masked generative image transformer. In CVPR, pages 11315–11325, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
- Better may not be fairer: A study on subgroup discrepancy in image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4956–4966, 2023.
- PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
- Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- GLaMs: Efficient scaling of language models with mixture-of-experts. In ICML, 2022.
- Taming transformers for high-resolution image synthesis. In CVPR, pages 12868–12878, 2020.
- Structure and content-guided video synthesis with diffusion models. In CVPR, pages 7346–7356, 2023.
- Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In CVPR, pages 22930–22941, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2(3):4, 2023.
- Cnn architectures for large-scale audio classification. In ICASSP, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. arXiv:2204.03458, 2022b.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
- StarCoder: may the source be with you! arXiv:2305.06161, 2023.
- Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749, 2023.
- Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022.
- OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3):1623–1637, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
- Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. IJCV, 128(10):2586–2606, 2020.
- A step toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 916–925, 2021.
- Consensus and subjectivity of skin tone annotation for ML fairness. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Disentangling architecture and training for optical flow. In ECCV, 2022.
- Any-to-any generation via composable diffusion. arXiv preprint arXiv:2305.11846, 2023.
- Ul2: Unifying language learning paradigms. In ICLR, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Maxvit: Multi-axis vision transformer. In ECCV, pages 459–479, 2022.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Attention is all you need. NeurIPS, 30, 2017.
- Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 35:23371–23385, 2022.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
- Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023d.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Magvit: Masked generative video transformer. In CVPR, pages 10459–10469, 2023a.
- Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. arXiv preprint arXiv:2306.17842, 2023b.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023c.
- Video probabilistic diffusion models in projected latent space. In CVPR, pages 18456–18466, 2023d.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
- Adding conditional control to text-to-image diffusion models. In CVPR, pages 3836–3847, 2023b.
- Auditing gender presentation differences in text-to-image models. arXiv preprint arXiv:2302.03675, 2023c.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.