UniVG: Towards UNIfied-modal Video Generation (2401.09084v1)
Abstract: Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- Improved techniques for training score-based generative models. In NeurIPS, 2020.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In CVPR, 2023.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2022.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
- Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.
- Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
- More control for free! image synthesis with semantic diffusion guidance. In WACV, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- Video generation from text. In AAAI, 2017.
- To create what you tell: Generating videos from captions. In ACM MM, 2017.
- Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Sync-draw: Automatic video generation using deep recurrent attentive architectures. In ACM MM, 2017.
- Long video generation with time-agnostic VQGAN and time-sensitive transformer. In ECCV, 2022.
- Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021.
- Wavegrad: Estimating gradients for waveform generation. In ICLR, 2021.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In ICML.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML, 2023.
- Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023.
- Video diffusion models. In NeurIPS, 2022.
- Riemannian diffusion models. In NeurIPS, 2022.
- Denoising diffusion implicit models. In ICLR, 2021.
- Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022.
- Fast sampling of diffusion models with exponential integrator. In ICLR, 2023.
- Learning fast samplers for diffusion models by differentiating through sample quality. In ICLR, 2023.
- Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Latent video transformer. In Giovanni Maria Farinella, Petia Radeva, José Braz, and Kadi Bouatouch, editors, VISIGRAPP, 2021.
- Videogpt: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157, 2021.
- A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
- Auto-encoding variational bayes. In ICLR, 2014.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 2021.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Nicholas Guttenberg. Diffusion with offset noise, 1 2023.
- Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 2021.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017.
- Pika labs. Accessed December 18, 2023. [Online]. Available: https://www.pika.art/.
- Gen-2. Accessed December 18, 2023. [Online]. Available: https://research.runwayml.com/gen2.
- Ludan Ruan (7 papers)
- Lei Tian (78 papers)
- Chuanwei Huang (9 papers)
- Xu Zhang (343 papers)
- Xinyan Xiao (41 papers)