DisPose: Disentangling Pose Guidance for Controllable Human Image Animation (2412.09349v3)

Published 12 Dec 2024 in cs.CV

Abstract: Controllable human image animation aims to generate videos from reference images using driving videos. Due to the limited control signals provided by sparse guidance (e.g., skeleton pose), recent works have attempted to introduce additional dense conditions (e.g., depth map) to ensure motion alignment. However, such strict dense guidance impairs the quality of the generated video when the body shape of the reference character differs significantly from that of the driving video. In this paper, we present DisPose to mine more generalizable and effective control signals without additional dense input, which disentangles the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence. Specifically, we generate a dense motion field from a sparse motion field and the reference image, which provides region-level dense guidance while maintaining the generalization of the sparse pose control. We also extract diffusion features corresponding to pose keypoints from the reference image, and then these point features are transferred to the target pose to provide distinct identity information. To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet that improves the quality and consistency of generated videos while freezing the existing model parameters. Extensive qualitative and quantitative experiments demonstrate the superiority of DisPose compared to current methods. Project page: \href{https://github.com/lihxxx/DisPose}{https://github.com/lihxxx/DisPose}.

Summary

The paper introduces a novel framework that disentangles pose guidance into sparse motion field and keypoint correspondence modules to enhance animation control.
It integrates a hybrid ControlNet that injects both motion field and point correspondence signals into existing denoising models with minimal modifications.
Results demonstrate significant improvements in FID-FVD and VBench metrics, confirming robust cross-identity animation quality and computational efficiency.

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

The paper under examination presents a novel approach called DisPose for controllable human image animation, aimed at enhancing animation quality and consistency without the reliance on dense motion representations. DisPose addresses the challenge of animating static character images using driving videos by innovatively utilizing sparse skeleton poses for better motion alignment, even when faced with disparities between reference characters and driving characters.

Summary

Controllable human image animation has been a burgeoning area of interest due to its potential applications in creative domains and digital humans. Current techniques that attempt to perform such animations generally utilize either sparse or dense control signals but often grapple with balancing effectiveness and adaptability across varying body shapes and movements. The DisPose framework focuses on disentangling pose guidance into two primary components: motion field guidance and keypoint correspondence. By relying on skeleton pose input, DisPose avoids the common pitfalls associated with dense input conditions, such as geometric constraints and higher model complexity.

Methodology

The DisPose approach is notable for its integration into existing animation frameworks without requiring significant alterations, essentially operating as a plug-and-play module. The methodology can be summarized through these key components:

Motion Field Guidance:
- Sparse Motion Field: Employing DWPose for keypoint estimation, it tracks trajectories over frames to derive a sparse motion field. Enhancements are applied using Gaussian filtering for robust pattern recognition.
- Dense Motion Field: Utilizes Condition Motion Propagation (CMP) to transform sparse guidance into a rich motion propagation starting from the reference frame, thus bypassing strict geometric constraints during inference.
Keypoint Correspondence: Extracting diffusion features based on key points, derived from a pre-trained image diffusion model, allows for a vivid preservation of identity and appearance across animations. The sparse features are correlated to the target poses via the hybrid ControlNet architecture.
Hybrid ControlNet integration: This architectural component manages the injection of motion field and point correspondence as control signals into leading denoising frameworks within image animation models.

Results and Implications

Numerical evidence from extensive qualitative and quantitative comparisons, particularly on challenging datasets like TikTok, underscores the efficacy of DisPose over extant methods. For example, the model showed marked improvements in metrics like Frechet Inception Distance with Frechet Video Distance (FID-FVD) and VBench metrics, indicating both improved video realism and alignment with human perception. Furthermore, DisPose's cross-identity animation capability highlights its robustness and potential for generating consistent, high-quality animations across varied identity inputs.

Future Work

While DisPose presents significant advancements in controllable human animation, areas for further enhancement include addressing the synthesis of unseen parts in characters and exploring multi-view synthesis capabilities. The inclusion of 3D sparse poses as control conditions and further integration with camera control models may provide avenues to bypass current limitations in scene synthesis and viewpoint variability.

Conclusion

This paper lays a foundational framework for achieving high-quality, controllable animations by utilizing sparse pose information effectively. DisPose manages to provide a balance between consistent visual output and computational efficiency, opening new possibilities for practical applications in AI-driven animation systems. Through its modular design, it offers seamless integration into existing systems, fostering innovation without the need for dense input representations.

In adopting DisPose, researchers and practitioners can potentially enhance the fidelity and consistency of character animations, thereby broadening the scope and applicability of such techniques in real-world scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - lihxxx/DisPose: This repository is the official implementation of DisPose (9 stars)

Tweets

https://twitter.com/dreamingtulpa/status/1871118620014821854

https://twitter.com/milbon_/status/1871119573736726601

https://twitter.com/techwith_ram/status/1871126197695869038

https://twitter.com/venturemanny/status/1871339696636506324

https://twitter.com/venturemanny/status/1871339356302283244