Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniVG: Towards UNIfied-modal Video Generation (2401.09084v1)

Published 17 Jan 2024 in cs.CV

Abstract: Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  2. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  3. Improved techniques for training score-based generative models. In NeurIPS, 2020.
  4. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  5. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  6. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  7. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  8. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  9. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
  10. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  11. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  12. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  13. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  14. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  15. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  16. Preserve your own correlation: A noise prior for video diffusion models. In CVPR, 2023.
  17. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  18. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2022.
  19. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  20. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  21. Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
  22. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  23. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.
  24. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
  25. More control for free! image synthesis with semantic diffusion guidance. In WACV, 2023.
  26. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  27. Video generation from text. In AAAI, 2017.
  28. To create what you tell: Generating videos from captions. In ACM MM, 2017.
  29. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
  30. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  31. Sync-draw: Automatic video generation using deep recurrent attentive architectures. In ACM MM, 2017.
  32. Long video generation with time-agnostic VQGAN and time-sensitive transformer. In ECCV, 2022.
  33. Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021.
  34. Wavegrad: Estimating gradients for waveform generation. In ICLR, 2021.
  35. Grad-tts: A diffusion probabilistic model for text-to-speech. In ICML.
  36. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In ICML, 2023.
  37. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
  38. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023.
  39. Video diffusion models. In NeurIPS, 2022.
  40. Riemannian diffusion models. In NeurIPS, 2022.
  41. Denoising diffusion implicit models. In ICLR, 2021.
  42. Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022.
  43. Fast sampling of diffusion models with exponential integrator. In ICLR, 2023.
  44. Learning fast samplers for diffusion models by differentiating through sample quality. In ICLR, 2023.
  45. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  46. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  47. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  48. Latent video transformer. In Giovanni Maria Farinella, Petia Radeva, José Braz, and Kadi Bouatouch, editors, VISIGRAPP, 2021.
  49. Videogpt: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157, 2021.
  50. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  51. Auto-encoding variational bayes. In ICLR, 2014.
  52. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 2021.
  53. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  54. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  55. Nicholas Guttenberg. Diffusion with offset noise, 1 2023.
  56. Dpm-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
  57. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  58. A large-scale study on unsupervised spatiotemporal representation learning. In CVPR, 2021.
  59. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  60. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  61. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  62. SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  63. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 2017.
  64. Pika labs. Accessed December 18, 2023. [Online]. Available: https://www.pika.art/.
  65. Gen-2. Accessed December 18, 2023. [Online]. Available: https://research.runwayml.com/gen2.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ludan Ruan (7 papers)
  2. Lei Tian (78 papers)
  3. Chuanwei Huang (9 papers)
  4. Xu Zhang (343 papers)
  5. Xinyan Xiao (41 papers)
Citations (2)

Summary

  • The paper introduces UniVG, a novel approach that supports multi-task video generation by unifying text and image conditions under high- and low-freedom settings.
  • It employs a Multi-condition Cross Attention module for high-freedom tasks and Biased Gaussian Noise for low-freedom tasks to maintain precise adherence to input conditions.
  • Quantitative measures and human evaluations demonstrate UniVG's strong performance and frame consistency, rivaling leading systems like Gen2.

Overview of Uni-modal Video Generation

The pursuit of sophisticated video generation systems has led to remarkable advancements, particularly with diffusion-based generative models. Current systems primarily handle singular objectives, such as text-to-video or image-to-video generation. This limited scope fails to cater to users who require flexibility in input conditions, possibly lacking text or images or intending to use them interchangeably. To resolve this, a novel approach named Uni-modal Video Generation (UniVG) has been introduced, supporting multi-task video creation by accommodating a diverse range of input conditions across both text and image modalities.

Categorizing Video Generation Tasks

The core innovation of UniVG lies in its categorization of video generation tasks into high-freedom and low-freedom categories based on the construct of generative freedom. High-freedom tasks receive loosely defined input conditions, granting the model a broad canvas to render videos. In contrast, low-freedom tasks work within strict constraints, often at the pixel level, necessitating precise adherence to input conditions.

For high-freedom applications, UniVG deploys a Multi-condition Cross Attention module. It synchronizes video content with the semantics of texts and images, offering a high degree of creative liberty to the generative model. On the other hand, Biased Gaussian Noise mechanisms are introduced for low-freedom tasks. This innovative alternative to random Gaussian Noise aids in preserving the integrity of the input conditions, optimizing content retention during video generation.

Advancements in System Performance

UniVG significantly surpasses existing methods in quantitative measures and equates to Gen2, a leading closed-source method in human evaluations. It accommodates flexible conditions for video generation, ranging from text/image-driven creations to intricate tasks like image animation and super-resolution. The system particularly excels in frame consistency, contributing to its superiority in creating coherent and visually appealing videos.

Moreover, the UniVG framework allows for the scaling of influence from text and image inputs, yielding a spectrum of videos that range from dominantly text-informed to richly image-aligned narratives — a testament to the system's adaptability.

Future Directions and Conclusion

This novel approach not only opens a new chapter in video generation but also suggests potential applications for other constrained generation tasks. While the current iteration already demonstrates an impressive ability to generate high-quality, aligned videos, future work may explore enhancing dynamic elements and exploring applications in related domains. The introduction of UniVG marks a turning point, offering a promising new toolkit for the generation of videos across a vast array of input conditions, thus significantly broadening the horizon for both creators and AI in the field of video production.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub