ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model (2503.21144v1)

Published 27 Mar 2025 in cs.CV

Abstract: Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

The paper "ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model" presents a novel framework for generating real-time, stylized portrait video animations. This work is particularly significant due to the increasing demand for realistic digital human interactions driven by advancements in LLMs and diffusion models.

Framework Overview and Methodology

The authors address existing limitations in portrait video generation that primarily focus on generating head movements but lack synchronized upper-body motions and fine-grained facial expression control. The paper introduces a two-stage approach to overcome these challenges:

Hierarchical Motion Diffusion Model: The first stage leverages a diffusion model that integrates both explicit and implicit motion representations informed by audio inputs. This model is capable of producing coordinated facial expressions and head-body movements with stylistic control and synchronization.
Portrait Video Generation with Hand Control Signals: The second stage involves video synthesis incorporating upper body and hand gestures by injecting explicit hand control signals into the generator. A face refinement module is deployed to enhance realistic and expressive video output, ensuring the generation of portrait videos at resolutions up to 512 × 768 pixels and 30fps on a 4090 GPU.

Experimental Evaluation

The paper reports strong quantitative results demonstrating the ability of the framework to produce rich expressiveness and natural movements. For instance, the proposed method shows clear improvements in metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Fréchet Inception Distance (FID) compared to existing methods. The numerical stability across inference results and resolution is particularly noteworthy, highlighting efficient utilization of computational resources.

Implications and Future Directions

Practically, this research has broad implications for interactive applications, such as virtual avatars, video conferencing, and augmented reality, offering capabilities for more lifelike user interactions with digital agents. Theoretically, the integration of hierarchical diffusion models enhances the understanding and application of stylistic control in real-time video synthesis.

Looking ahead, there is potential to further refine and scale this approach to handle full-body animation, including limb and vector dynamics, in a cohesive framework aligned with the emergence of more sophisticated modeling of human emotions and expressions in virtual environments.

Conclusion

This paper contributes significant advancements in the domain of stylized real-time portrait video generation. It offers a comprehensive solution leveraging hierarchical motion diffusion models for synchronized and expressive video synthesis, paving the way for future developments in AI-driven digital human interactions. Researchers can build upon this work to explore extended capabilities in real-time video generation, utilizing more complex datasets and experimental environments to enhance realism and engagement in interactive systems.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (6)

Tweets

https://twitter.com/_akhaliq/status/1905452558258151807

https://twitter.com/taziku_co/status/1905587712616530354

YouTube

Show All Videos