Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MotionMaster: Training-free Camera Motion Transfer For Video Generation (2404.15789v2)

Published 24 Apr 2024 in cs.CV

Abstract: The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.

MotionMaster: A Training-free Approach for Flexible Camera Motion Transfer in Video Generation

Introduction

MotionMaster introduces a novel, training-free methodology for transferring camera motion from a source video to a newly generated video without the need for retraining models. The paper establishes a mechanism to disentangle camera and object motions, allowing for precise camera motion control. This approach addresses the limitations of prior methods that involved significant computational resources and lacked flexibility in handling complex camera motions like those seen in professional film production.

Methodology Overview

The disentanglement process in MotionMaster leverages a one-shot and a few-shot method for camera motion extraction:

  1. One-shot Camera Motion Disentanglement:
    • Utilizes a single source video to separate camera and object motion.
    • Estimates camera motion from background areas of a video and infers it for the regions with object movements using a Poisson equation.
  2. Few-shot Camera Motion Disentanglement:
    • Extracts common camera motion from multiple videos that share similar camera motions.
    • Employs a window-based clustering method to effectively isolate camera motion from object motion by analyzing temporal attention maps across several videos.
  3. Camera Motion Combination:
    • The disentangled camera motions allow for combination and regional application, significantly enhancing the flexibility of video generation with respect to camera control.

Experimentation and Results

Extensive experiments demonstrate MotionMaster's ability to apply extracted camera motions effectively across various scenarios. Notably, the model supports advanced camera maneuvers like Dolly Zoom and variable-speed zooming, which are directly transferable to new video content.

  • Comparative Analysis:
    • MotionMaster outperforms existing methods like AnimateDiff and MotionCtrl in terms of video quality and fidelity to the camera motion patterns of the source material.
    • It achieves superior results in handling complex camera motion scenarios with significantly reduced computational overheads due to its training-free nature.
  • Quantitative Metrics:
    • The model's performance is highlighted through standard video generation metrics like FID-V and FVD, indicating high-quality video output and accurate camera motion replication.

Future Research Directions

The implications for further research include exploring more granular disentanglement techniques that could allow for even finer control over the interaction between object and camera motions. Additionally, integrating this approach with more extensive video generation frameworks could pave the way for real-time video production tools in virtual reality and interactive media.

Conclusion

MotionMaster sets a new benchmark for flexible, efficient camera motion control in video generation. By eliminating the need for retraining and effectively decoupling camera and object motions, it offers a scalable solution adaptable to various professional video production needs. This approach could significantly impact how camera motions are managed in automatic video generation, leading to more creative and dynamic visual content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  2. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  3. Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
  4. A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
  5. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.
  6. H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” arXiv preprint arXiv:2401.09047, 2024.
  7. H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang et al., “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023.
  8. T.-S. Chen, C. H. Lin, H.-Y. Tseng, T.-Y. Lin, and M.-H. Yang, “Motion-conditioned diffusion model for controllable video synthesis,” arXiv preprint arXiv:2304.14404, 2023.
  9. S. Tu, Q. Dai, Z.-Q. Cheng, H. Hu, X. Han, Z. Wu, and Y.-G. Jiang, “Motioneditor: Editing video motion via content-aware diffusion,” arXiv preprint arXiv:2311.18830, 2023.
  10. C. Chen, J. Shu, L. Chen, G. He, C. Wang, and Y. Li, “Motion-zero: Zero-shot moving object control framework for diffusion-based video generation,” arXiv preprint arXiv:2401.10150, 2024.
  11. S. Yang, L. Hou, H. Huang, C. Ma, P. Wan, D. Zhang, X. Chen, and J. Liao, “Direct-a-video: Customized video generation with user-directed camera movement and object motion,” arXiv preprint arXiv:2402.03162, 2024.
  12. J. Bai, T. He, Y. Wang, J. Guo, H. Hu, Z. Liu, and J. Bian, “Uniedit: A unified tuning-free framework for video motion and appearance editing,” arXiv preprint arXiv:2402.13185, 2024.
  13. C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 932–15 942.
  14. J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7623–7633.
  15. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  16. Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” arXiv preprint arXiv:2312.03641, 2023.
  17. C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in neural information processing systems, vol. 29, 2016.
  18. T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, “Few-shot video-to-video synthesis,” arXiv preprint arXiv:1910.12713, 2019.
  19. M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2830–2839.
  20. J. Zhang, C. Xu, L. Liu, M. Wang, X. Wu, Y. Liu, and Y. Jiang, “Dtvnet: Dynamic time-lapse video generation via single still image,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16.   Springer, 2020, pp. 300–315.
  21. R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.
  22. A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 563–22 575.
  23. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022.
  24. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
  25. D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.
  26. J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023.
  27. W. Chen, J. Wu, P. Xie, H. Wu, J. Li, X. Xia, X. Xiao, and L. Lin, “Control-a-video: Controllable text-to-video generation with diffusion models,” arXiv preprint arXiv:2305.13840, 2023.
  28. P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356.
  29. Y. Guo, C. Yang, A. Rao, M. Agrawala, D. Lin, and B. Dai, “Sparsectrl: Adding sparse controls to text-to-video diffusion models,” arXiv preprint arXiv:2311.16933, 2023.
  30. Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” arXiv preprint arXiv:2312.04433, 2023.
  31. H. Jeong, G. Y. Park, and J. C. Ye, “Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models,” arXiv preprint arXiv:2312.00845, 2023.
  32. Y. Jain, A. Nasery, V. Vineet, and H. Behl, “Peekaboo: Interactive video generation via masked-diffusion,” arXiv preprint arXiv:2312.07509, 2023.
  33. Y. Teng, E. Xie, Y. Wu, H. Han, Z. Li, and X. Liu, “Drag-a-video: Non-rigid video editing with point-based interaction,” arXiv preprint arXiv:2312.02936, 2023.
  34. R. Wu, L. Chen, T. Yang, C. Guo, C. Li, and X. Zhang, “Lamp: Learn a motion pattern for few-shot-based video generation,” arXiv preprint arXiv:2310.10769, 2023.
  35. R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou, “Motiondirector: Motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2310.08465, 2023.
  36. X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou, “Videocomposer: Compositional video synthesis with motion controllability,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  37. Y. Deng, R. Wang, Y. Zhang, Y.-W. Tai, and C.-K. Tang, “Dragvideo: Interactive drag-style video editing,” arXiv preprint arXiv:2312.02216, 2023.
  38. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
  39. S. Zhou, X. Jiang, W. Tan, R. He, and B. Yan, “Mvflow: Deep optical flow estimation of compressed videos with motion vector prior,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1964–1974.
  40. D. Fleet and Y. Weiss, “Optical flow estimation,” in Handbook of mathematical models in computer vision.   Springer, 2006, pp. 237–257.
  41. H. Zhang, D. Liu, Q. Zheng, and B. Su, “Modeling video as stochastic processes for fine-grained video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2225–2234.
  42. Z. Gharibi and S. Faramarzi, “Multi-frame spatio-temporal super-resolution,” Signal, Image and Video Processing, vol. 17, no. 8, pp. 4415–4424, 2023.
  43. D. Young, “Iterative methods for solving partial difference equations of elliptic type,” Transactions of the American Mathematical Society, vol. 76, no. 1, pp. 92–111, 1954.
  44. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  45. M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, no. 34, 1996, pp. 226–231.
  46. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  47. T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,” arXiv preprint arXiv:1812.01717, 2018.
  48. Y. Balaji, M. R. Min, B. Bai, R. Chellappa, and H. P. Graf, “Conditional gan with discriminative filter generation for text-to-video synthesis.” in IJCAI, vol. 1, no. 2019, 2019, p. 2.
  49. G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13.   Springer, 2003, pp. 363–370.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Teng Hu (26 papers)
  2. Jiangning Zhang (102 papers)
  3. Ran Yi (68 papers)
  4. Yating Wang (39 papers)
  5. Hongrui Huang (3 papers)
  6. Jieyu Weng (2 papers)
  7. Yabiao Wang (93 papers)
  8. Lizhuang Ma (145 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com