MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation (2311.16635v1)
Abstract: Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, 2019.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Language models are few-shot learners. NeurIPS, 2020.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
- Diffsynth: Latent in-iteration deflickering for realistic video synthesis. arXiv preprint arXiv:2308.03463, 2023.
- Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
- Empowering dynamics-aware text-to-video diffusion with large language models. arXiv preprint arXiv:2308.13812, 2023.
- Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
- Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Video diffusion models. In NeurIPS, 2022.
- Large language models are frame-level directors for zero-shot text-to-video generation. arXiv preprint arXiv:2305.14330, 2023.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
- Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 2023.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022.
- Text2performer: Text-driven human video generation. arXiv preprint arXiv:2304.08483, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Segment anything. arXiv:2304.02643, 2023.
- Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
- Video generation from text. In AAAI, 2018.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
- Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023.
- Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
- Dual-stream diffusion net for text-to-video generation. arXiv preprint arXiv:2308.08316, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023.
- Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- To create what you tell: Generating videos from captions. In ACM MM, 2017.
- Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427, 2023.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- Dancing avatar: Pose and text-guided human motion videos synthesis with image diffusion model. arXiv preprint arXiv:2308.07749, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
- Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
- Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
- Make-your-video: Customized video generation using textual and structural guidance. arXiv preprint arXiv:2306.00943, 2023.
- Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
- Probabilistic adaptation of text-to-video models. arXiv preprint arXiv:2306.01872, 2023.
- Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023.
- Least-to-most prompting enables complex reasoning in large language models. In ICLR, 2022.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Moviefactory: Automatic movie creation from text using large generative models for language and images. In ACM MM BNI, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.