DisCo: Disentangled Control for Realistic Human Dance Generation (2307.00040v3)

Published 30 Jun 2023 in cs.CV and cs.AI

Abstract: Generative AI has made significant strides in computer vision, particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements, it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies, primarily tailored for human motion transfer, encounter difficulties when confronted with real-world dance scenarios (e.g., social media dance), which require to generalize across a wide spectrum of poses and intricate human details. In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce DISCO, which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DisCc can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code is available at https://disco-dance.github.io/.

Citations (41)

View on Semantic Scholar

Summary

The paper introduces DisCo, a novel approach that disentangles control of subjects, backgrounds, and poses to generate realistic human dance sequences.
It employs a human attribute pre-training strategy with large-scale image datasets to improve generalizability without heavy reliance on video data.
Evaluations show superior performance on metrics like FID, SSIM, and FVD, paving the way for enhancements such as fine-grained and multi-person interaction controls.

DisCo: Disentangled Control for Realistic Human Dance Generation

The paper "DisCo: Disentangled Control for Realistic Human Dance Generation" addresses specific challenges in generative AI, focusing on text-to-image and text-to-video synthesis, particularly under the context of human-centric content such as dance. Despite existing advancements, synthesizing realistic human dance sequences remains complex due to the wide variability in poses and subjective human details encountered in social media scenarios.

Current methodologies primarily aimed at human motion transfer struggle to generalize across diverse and intricate real-world dance scenarios like those found on social media platforms. This paper identifies two critical attributes necessary for effective human dance synthesis: generalizability and compositionality.

Generalizability and Compositionality

In addressing these challenges, the authors introduce DisCo, which emphasizes two key attributes:

Generalizability: The capability to handle non-generic human viewpoints, unseen human subjects, diverse backgrounds, and novel poses.
Compositionality: The ability to seamlessly compose seen/unseen subjects, backgrounds, and poses from varying sources.

DisCo Model Architecture

DisCo presents a novel architecture with disentangled control to improve compositionality and incorporates a human attribute pre-training approach for enhanced generalizability. The design effectively controls human dance synthesis by disentangling the control mechanisms for human subjects, backgrounds, and poses.

Model Architecture: Utilizes ControlNet to facilitate background and pose control by employing VAE encoders for rich semantic representation of the background and convolutional encoders for skeletal abstraction. Human subjects are incorporated via CLIP image embeddings, enhancing dynamic synthesis with high fidelity.
Human Attribute Pre-Training: A pre-training strategy leverages large-scale human image datasets to distinguish foreground subjects from backgrounds, boosting generalizability without dependency on extensive video datasets.

Evaluation and Implications

The authors conducted comprehensive evaluations, showcasing DisCo’s superior performance across various metrics like FID, SSIM, and FVD, with notable improvements in diverse compositional scenarios. Moreover, DisCo's ability to generate temporally consistent results without explicit temporal modeling marks a significant advancement compared to existing state-of-the-art methods.

Future Directions

The paper hints at further exploration into integrating more complex elements such as hand keypoints for fine-grained control and adapting to complex scenarios like multi-person interactions. These future enhancements could potentially broaden the applicability of DisCo in realistic and interactive applications within social media and entertainment industries.

In conclusion, DisCo emerges as a robust framework for realistic and flexible human dance generation, achieving a balance between faithfulness and diversity in generation, therefore paving the way for innovative applications in AI-driven content creation.

DisCo: Disentangled Control for Realistic Human Dance Generation (2307.00040v3)

Summary

DisCo: Disentangled Control for Realistic Human Dance Generation

Generalizability and Compositionality

DisCo Model Architecture

Evaluation and Implications

Future Directions

GitHub

YouTube

DisCo: Disentangled Control for Realistic Human Dance Generation (2307.00040v3)

Summary

DisCo: Disentangled Control for Realistic Human Dance Generation

Generalizability and Compositionality

DisCo Model Architecture

Evaluation and Implications

Future Directions

Related Papers

GitHub

YouTube