Papers
Topics
Authors
Recent
Search
2000 character limit reached

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

Published 28 Nov 2023 in cs.CV | (2311.16635v1)

Abstract: Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  2. Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, 2019.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  4. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  5. Language models are few-shot learners. NeurIPS, 2020.
  6. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
  7. Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
  8. Diffsynth: Latent in-iteration deflickering for realistic video synthesis. arXiv preprint arXiv:2308.03463, 2023.
  9. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  10. Empowering dynamics-aware text-to-video diffusion with large language models. arXiv preprint arXiv:2308.13812, 2023.
  11. Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023.
  12. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
  13. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023.
  14. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  15. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  16. Denoising diffusion probabilistic models. NeurIPS, 2020.
  17. Video diffusion models. In NeurIPS, 2022.
  18. Large language models are frame-level directors for zero-shot text-to-video generation. arXiv preprint arXiv:2305.14330, 2023.
  19. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
  20. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 2023.
  21. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML, 2022.
  22. Text2performer: Text-driven human video generation. arXiv preprint arXiv:2304.08483, 2023.
  23. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  24. Segment anything. arXiv:2304.02643, 2023.
  25. Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
  26. Video generation from text. In AAAI, 2018.
  27. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  28. Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023.
  29. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
  30. Dual-stream diffusion net for text-to-video generation. arXiv preprint arXiv:2308.08316, 2023.
  31. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  32. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  33. Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023.
  34. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
  35. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  36. OpenAI. Gpt-4 technical report, 2023.
  37. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  38. To create what you tell: Generating videos from captions. In ACM MM, 2017.
  39. Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427, 2023.
  40. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  41. Dancing avatar: Pose and text-guided human motion videos synthesis with image diffusion model. arXiv preprint arXiv:2308.07749, 2023.
  42. Learning transferable visual models from natural language supervision. In ICML, 2021.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  44. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  45. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR, 2023.
  46. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  47. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  48. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  49. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  50. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  51. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023.
  52. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  53. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
  54. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023.
  55. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  56. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  57. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022.
  58. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  59. Make-your-video: Customized video generation using textual and structural guidance. arXiv preprint arXiv:2306.00943, 2023.
  60. Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
  61. Probabilistic adaptation of text-to-video models. arXiv preprint arXiv:2306.01872, 2023.
  62. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
  63. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  64. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  65. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023.
  66. Least-to-most prompting enables complex reasoning in large language models. In ICLR, 2022.
  67. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  68. Moviefactory: Automatic movie creation from text using large generative models for language and images. In ACM MM BNI, 2023.
Citations (2)

Summary

  • The paper introduces a framework using LLMs to extract object-level motion priors for improved zero-shot text-to-video synthesis.
  • It employs disentangled motion control via region-specific warping, achieving superior semantic and temporal alignment.
  • Experimental results show significant gains in textual alignment and motion correctness, validating its effectiveness in video generation.

MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation

Introduction

The paper "MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation" (2311.16635) addresses the intrinsic limitations of zero-shot text-to-video (T2V) generation pipelines that lack explicit modeling of object-centric motion priors implied by textual prompts. Previous approaches in zero-shot T2V either treat motion globally or rely on prompt decomposition into per-frame instructions, resulting in text-agnostic and often entangled motion trajectories, poor object control, and diminished semantic coherence across frames. MotionZero introduces a framework for prompt-adaptive and disentangled motion control, leveraging LLMs to extract object-level motion priors and constraining the generated video by these priors without the need for video training data.

Zero-shot T2V operates on pretrained text-to-image (T2I) diffusion models (e.g., Stable Diffusion), repurposing them for video synthesis by controlling temporal progression across frames. Works such as Text2Video-Zero and DirecT2V integrate LLM-generated per-frame specifications or interpolate prompt sequences, but generally fail to accurately reflect complex, text-implied motion patterns. These strategies either induce a static, prompt-agnostic motion template (Text2Video-Zero) or encounter spatial and temporal incoherence due to lack of object-level disentanglement (DirecT2V, Free-Bloom).

MotionZero directly targets these gaps by using prompt and image understanding to extract explicit per-object motion priors and execute region-specific, disentangled warping in the latent space. This aligns video content more precisely with prompt semantics and allows multiple objects to be independently controlled.

Methodology

Extracting Motion Priors

MotionZero applies a two-stage process to extract and utilize motion priors:

  1. LLM-based Motion Reasoning: Given a prompt, an LLM (e.g., GPT-4) is queried to decompose the scene into moving and static objects, as well as infer the canonical moving direction for each object. Explicit directionality is determined, partitioning object motion into discrete bins (e.g., up, down, left, right, and diagonals).
  2. Disambiguating Non-directional Actions with Vision: For prompts with ambiguous verbs (e.g., “walking” or “riding”), the initial frame is synthesized using DDIM sampling from the T2I model. A Visual Question Answering (VQA) module infers likely movement direction from visual context, resolving ambiguities not explicit in text.

The resultant per-object motion plan is {Object: [Direction sequence], ...}, governing how each object should move per frame.

Disentangled Motion Control

Rather than warping the entire image globally, MotionZero leverages advanced open-vocabulary segmentation (SAM/Grounded-SAM) to obtain precise spatial masks for each controllable object in the initial frame. Motion control is then executed as follows:

  • Spatial Warping: For each video frame, object-specific feature regions (as specified by masks) are warped according to the corresponding motion prior. Warp operations are reversed for the background to avoid producing visible artifacts and ensure proper occlusion/inpainting.
  • Feature Fusion and Stochasticity: The disentangled, motion-controlled features are further processed via additional diffusion steps, ensuring high visual fidelity, stochastic variation across frames, and avoidance of hard “cut-and-paste” artifacts. The background is explicitly modeled as the “(c+1)-th character” with no motion apart from camera-induced or evolving context.

Motion-Aware Cross-frame Attention

Static cross-frame attention strategies, as in earlier zero-shot T2V works, are insufficient to handle diverse motion amplitudes. MotionZero introduces a Motion-Aware Attention (MAA) mechanism: The anchor frame for query-key-value attention is dynamically updated based on the IoU between current and previous object masks. Once the spatial configuration deviates significantly (tuned by a threshold γ\gamma), anchor frames are replaced—enabling high-mobility scenes to attend recent frames, and static scenes to anchor to the beginning for semantic consistency. This mechanism prevents issues such as object distortion and missing regions due to inappropriate temporal reference selection.

Applications

The granularity and modularity of MotionZero’s motion reasoning and control enable:

  • Foreground/Background Semantic Editing: Independent semantic control enables zero-shot editing of either foreground or background while maintaining plausible interaction and visual harmony.
  • Skeleton-based Body Control: LLM-inferred pseudo-skeleton trajectories, derived from text, can drive motion of articulated bodies (without explicit skeleton video supervision), exploiting ControlNet integration for fine-grained pose manipulation.
  • Camera Motion Modeling: Relative motion between foreground and background can be used to mimic camera movements by applying inverse transforms to non-actor regions.
  • Multi-stage/Evolving Event Video Synthesis: Complex prompts expressing event unfolding or scene transitions can be parsed into per-stage/slice motion plans, supporting temporally segmented video composition under LLM guidance.

Experimental Evaluation

Qualitative and Quantitative Results

MotionZero is compared to CogVideo, DirecT2V, and Text2Video-Zero across textual alignment (CLIP similarity), motion correctness (trajectory tracking versus ground-truth LLM-extracted motion), and user preference (human ratings on semantic consistency and visual quality). The framework demonstrates:

  • Top CLIP Score: Achieves the highest frame-textual alignment, outperforming models trained with large-scale video data.
  • Dominant Motion Correctness: Surpasses CogVideo, DirecT2V, and Zero by substantial margins in trajectory accuracy, e.g., 82.86%82.86\% (Ours) vs. 55.71%55.71\% (CogVideo).
  • Clear Preference in Human Evaluation: MotionZero is favored by users for both textual alignment and video quality, especially for scenes with multiple, independently moving objects or articulated actions.

Ablation Studies

Removing LLM-extracted motion priors produces prompt-agnostic, erroneous trajectories; removing disentangled region control yields globally distorted or entangled motion; disabling MAA destabilizes object and background coherence, leading to object fusion or disappearance. Varying the IoU threshold for MAA reveals an optimal balance (γ=0.6\gamma=0.6), with too frequent anchor updates undermining appearance consistency and too infrequent updates causing drift.

Implications and Future Directions

By rigorously exploiting motion priors at the object and prompt level, MotionZero demonstrates a substantial leap in zero-shot video controllability and coherence without any additional video supervision or fine-tuning. This paradigm enables dynamic manipulation and editing of generated content, unifying visual grounding and linguistic reasoning. Future explorations should examine scaling to longer and more complex prompt sequences, incorporating granular 3D/trajectory-level motion priors, and extending cross-modal interaction (e.g., audio-driven/scene-aware video generation). Deeper integration with universal segmentation and grounding frameworks, as well as more advanced VQA pipelines, may further reduce errors in ambiguous settings or heavily cluttered scenes.

Conclusion

MotionZero provides a prompt-adaptive, disentangled framework for zero-shot T2V generation, robustly extracting and exploiting motion priors via LLMs and vision models. The method consistently outperforms prior works in prompt-aligned motion realization, multi-entity disentanglement, and editing flexibility. MotionZero sets a new reference for semantic-controlled, training-free T2V synthesis and opens new pathways for scalable, semantically compositional video generation in alignment with open-world textual descriptions (2311.16635).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.