Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions (2401.01827v1)
Abstract: Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
- Pix2video: Video editing using image diffusion. arXiv:2303.12688, 2023.
- Stablevideo: Text-driven consistency-aware diffusion video editing. arXiv preprint arXiv:2308.09592, 2023.
- Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793, 2023a.
- Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
- Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
- Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 89–106. Springer, 2022.
- Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638, 2022.
- Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
- Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023a.
- Videoswap: Customized video subject swapping with interactive semantic point correspondence. arXiv preprint arXiv:2312.02087, 2023b.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023a.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023b.
- Unsupervised keypoint learning for guiding class-conditional video prediction. Advances in neural information processing systems, 32, 2019.
- Multi-concept customization of text-to-image diffusion. In CVPR, 2023a.
- Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023b.
- Ccvs: Context-aware controllable video synthesis. NeurIPS, 2021.
- Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023.
- Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 600–615, 2018.
- Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18444–18455, 2023.
- Sinfusion: Training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743, 2022.
- Video generation from single semantic label map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2019.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
- Temporal generative adversarial nets with singular value clipping. In ICCV, 2017.
- LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets. https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/, 2022.
- Mostgan-v: Video generation with temporal motion styles. In CVPR, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. arXiv preprint arXiv:2112.14683, 2021.
- Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023.
- Unsupervised learning of video representations using lstms. In ICML, 2015.
- A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
- Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853, 2022.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Generating videos with scene dynamics. NIPS, 2016.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
- Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
- Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023d.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, pages 720–736. Springer, 2022a.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022b.
- Dynamicrafter: Animating open-domain images with video diffusion priors. 2023.
- Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2364–2373, 2018.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022.
- Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023a.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023b.
- Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2021.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
- Adding conditional control to text-to-image diffusion models, 2023b.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. 2023c.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023d.
- Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.