Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DreaMoving: A Human Video Generation Framework based on Diffusion Models (2312.05107v2)

Published 8 Dec 2023 in cs.CV

Abstract: In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content Guider for identity preserving. The proposed model is easy to use and can be adapted to most stylized diffusion models to generate diverse results. The project page is available at https://dreamoving.github.io/dreamoving

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  2. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
  3. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  4. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
  5. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  6. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  7. Openclip, 2021. If you use this software, please cite it as below.
  8. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  9. High-resolution image synthesis with latent diffusion models, 2021.
  10. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  11. Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023.
  12. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  13. Adding conditional control to text-to-image diffusion models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Mengyang Feng (12 papers)
  2. Jinlin Liu (10 papers)
  3. Kai Yu (202 papers)
  4. Yuan Yao (292 papers)
  5. Zheng Hui (27 papers)
  6. Xiefan Guo (8 papers)
  7. Xianhui Lin (11 papers)
  8. Haolan Xue (2 papers)
  9. Chen Shi (55 papers)
  10. Xiaowen Li (14 papers)
  11. Aojie Li (3 papers)
  12. Miaomiao Cui (27 papers)
  13. Peiran Ren (28 papers)
  14. Xuansong Xie (69 papers)
  15. Xiaoyang Kang (7 papers)
  16. Biwen Lei (12 papers)
Citations (20)

Summary

An Overview of DreaMoving: A Diffusion-Based Framework for Human Video Generation

The paper presents "DreaMoving," a video generation framework grounded in diffusion models aimed at producing high-quality customized human videos. The system is designed with a primary focus on maintaining consistent motion and preserving identity while providing flexibility in stylistic adaptability. The framework hinges on two core components: the Video ControlNet for motion control and the Content Guider for identity preservation.

DreaMoving is an advancement in the human-centered video generation paradigms that have previously struggled with challenges surrounding the generation of human dance videos. A key issue has been the scarcity of open-source datasets and precise text descriptions necessary for effective Text-to-Video (T2V) model training. Prior methodologies like ControlNet have showcased potential in structural control but introduced computational complexity when tackling motion pattern precision.

Framework Architecture

The architecture is based on Stable-Diffusion models and consists of several integrated components:

  1. Denoising U-Net: This module serves as the backbone for video generation, incorporating motion blocks inspired by AnimateDiff to ensure temporal and motion fidelity.
  2. Video ControlNet: This component extends beyond traditional image ControlNet by including motion-controlling capabilities that process control sequences such as pose or depth, contributing to maintaining temporal consistency.
  3. Content Guider: This module enhances content control by integrating both image and text prompts, facilitating detailed human appearance preservation. By leveraging an image encoder and IP-Adapter techniques, the Content Guider translates reference images into detailed content embeddings.

Model Training and Evaluation

The training regime for each component is individually noted for its stringency, focusing on high-quality human dance video data. This includes long-frame pretraining to accommodate extended motion sequences and subsequent Video ControlNet refinement for improved expression and motion. The emphasis on specific fine-tuning demonstrates a methodical approach to enhancing model performance up to a pixel resolution of 512x512.

The methodology is nuanced, optimizing components like the motion blocks and cross-attention layers to achieve a seamless handling of intricate aspects of video generation, such as face and cloth semantic control integrated into the Content Guider.

Results and Implications

The evaluation illustrates DreaMoving's efficacy in generating not only high-quality human videos but also diverse stylistic outputs while retaining control over identity and motion features. The results highlight DreaMoving's versatility, creating outputs guided by text description, specific faces, attire, and stylized images.

Implications and Future Directions

From a theoretical perspective, DreaMoving advances video generation approaches by addressing both structural and content disparities found in previous methodologies. Practically, the framework sets a precedent for customizable video generation systems that blend personalization with technological rigor. Looking forward, further developments could explore the integration of multi-modality controls in video generation, potentially expanding applications in entertainment and media sectors while addressing ethical and privacy considerations in personalized content creation.

In summary, DreaMoving is a significant contribution to diffusion-based video generation, offering a robust framework that addresses previously unchartered challenges in human-centric content production. Further exploration in adaptive learning techniques and integration across various digital ecosystems could extend its applicability and capabilities.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com