Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (2401.01827v1)

Published 3 Jan 2024 in cs.CV

Abstract: Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  4. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), 2021.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023a.
  6. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  7. Pix2video: Video editing using image diffusion. arXiv:2303.12688, 2023.
  8. Stablevideo: Text-driven consistency-aware diffusion video editing. arXiv preprint arXiv:2308.09592, 2023.
  9. Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793, 2023a.
  10. Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
  11. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
  12. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  13. Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 89–106. Springer, 2022.
  14. Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638, 2022.
  15. Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
  16. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
  17. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023a.
  18. Videoswap: Customized video subject swapping with interactive semantic point correspondence. arXiv preprint arXiv:2312.02087, 2023b.
  19. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  20. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022.
  21. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  22. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  23. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  24. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  25. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  26. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  27. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  28. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023a.
  29. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023b.
  30. Unsupervised keypoint learning for guiding class-conditional video prediction. Advances in neural information processing systems, 32, 2019.
  31. Multi-concept customization of text-to-image diffusion. In CVPR, 2023a.
  32. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023b.
  33. Ccvs: Context-aware controllable video synthesis. NeurIPS, 2021.
  34. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023.
  35. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 600–615, 2018.
  36. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
  37. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  38. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  39. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18444–18455, 2023.
  40. Sinfusion: Training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743, 2022.
  41. Video generation from single semantic label map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2019.
  42. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  43. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  44. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  45. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  46. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  47. Temporal generative adversarial nets with singular value clipping. In ICCV, 2017.
  48. LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets. https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/, 2022.
  49. Mostgan-v: Video generation with temporal motion styles. In CVPR, 2023.
  50. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  51. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. arXiv preprint arXiv:2112.14683, 2021.
  52. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023.
  53. Unsupervised learning of video representations using lstms. In ICML, 2015.
  54. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  55. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
  56. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  57. Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853, 2022.
  58. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  59. Generating videos with scene dynamics. NIPS, 2016.
  60. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  61. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
  62. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
  63. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023d.
  64. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  65. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
  66. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  67. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, pages 720–736. Springer, 2022a.
  68. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022b.
  69. Dynamicrafter: Animating open-domain images with video diffusion priors. 2023.
  70. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2364–2373, 2018.
  71. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  72. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  73. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
  74. Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
  75. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023a.
  76. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023b.
  77. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2021.
  78. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  79. Adding conditional control to text-to-image diffusion models, 2023b.
  80. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. 2023c.
  81. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023d.
  82. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.
  83. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Citations (21)

Summary

  • The paper introduces MoonShot, a novel system that enables precise control over both visual appearance and geometry in video generation using multimodal inputs.
  • It employs a multimodal video block with decoupled cross-attention to process image and text simultaneously, enhancing generation fidelity and reducing retraining needs.
  • Empirical results demonstrate that MoonShot significantly improves visual quality and temporal consistency, outperforming traditional text-to-video diffusion models.

Overview of MoonShot

A groundbreaking development in video generation and editing, the system called MoonShot, is enabling unprecedented control over both visual appearance and geometry structure in generated videos. The fundamental advancements are anchored within what's termed as the multimodal video block (MVB), which is a unique assemblage of spatialtemporal layers paired with a decoupled cross-attention layer. This sophisticated cross-attention layer adeptly processes image and text inputs simultaneously, thereby enabling a novel approach to guide video generation with pivotal visual cues and textual descriptions together.

The Advent of Multimodal Control

In contrast to conventional text-to-video diffusion models that primarily rely on text for generating content, MoonShot incorporates both image and text inputs through a carefully crafted architecture. This dual-input strategy addresses the limitations of text-only models that often struggle with the precision needed for generating specific visual content. The inclusion of image conditions not only enriches the detail during video creation but also reduces the necessity for repetitive model fine-tuning, offering a pathway to zero-shot subject customized video generation. MoonShot makes use of pre-existing image control modules, ControlNet, to manipulate the geometry of videos without additional training requirements, which is a significant leap forward in the field.

Applications and Performance

MoonShot's adaptability extends to various applications, from personalized video creation to image animation and video editing, without the need for extensive retraining. Empirical results demonstrate that this model not only improves visual quality and temporal coherence but also outperforms existing models in controlled generation tasks. The ability to employ geometry inputs like depth or edge maps further extends the control over the structural aspects of the generated videos. MoonShot also shines in producing consistent and high-quality frames when provided with a still frame as an image condition, vying with the best foundation video diffusion models available.

Architectural Nuances and Future Implications

The architecture of MoonShot is pivotal to its success. By segregating the spatialtemporal layers from the cross-attention layers, the model ensures the undisturbed spatial feature distribution necessary for integrating the ControlNet. This design choice is supported by the use of space-time attention, which fosters temporal consistency and video smoothness. Meanwhile, the decoupled multimodal cross-attention layers furnish the model with the agility to manipulate both text and image inputs, ensuring the resulting video generation aligns with input cues effectively.

In essence, MoonShot presents a significant advancement in the domain of AI-generated videos, promising an array of creative possibilities and setting a new standard for controllable video generation. As this model becomes publicly accessible, it heralds a future where creating and editing videos with intricate and personalized details could become effortless and more intuitive.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube