FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis (2504.04842v1)

Published 7 Apr 2025 in cs.CV

Abstract: Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.

Summary

The paper introduces FantasyTalking, a method using a dual-stage audio-visual alignment strategy and diffusion transformers to generate realistic talking portraits.
FantasyTalking outperforms existing methods on key metrics like FVD, FID, Sync-C/D, IDC, and SD, producing videos with realistic sync and diverse motion.
The proposed dual-stage method has significant implications for VR, gaming, and other areas requiring realistic avatar synthesis and could redefine video generation benchmarks.

An Expert Review of "FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis"

The paper "FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis," authored by a team from Alibaba Group and Beijing University of Posts and Telecommunications, introduces an innovative approach to generating realistic talking head animations from static portrait images. This task is particularly challenging due to the need for accurately capturing facial expressions, synchronizing lip movements with audio, and integrating global and background dynamics seamlessly into the video output.

Methodology Overview

The core innovation of the paper is its dual-stage audio-visual alignment strategy. This framework uses a pretrained video diffusion transformer model to enable the synthesis of high-fidelity, coherent talking portraits. The dual-stage approach is crucial for tackling the complexity of audio-visual synchronization and dynamic avatar generation:

Clip-Level Audio-Visual Alignment: This initial stage focuses on establishing coherent global motion across the entire scene by integrating audio-driven dynamics aligned with reference portraits, background, and contextual elements. The model leverages the spatio-temporal modeling strengths of a diffusion transformer to capture audio-visual correlations over extended sequences, contributing to a unified motion depiction.
Frame-Level Refinement: The second stage zooms into precise synchronization at the frame level, primarily refining lip movements using a meticulously constructed lip-tracing mask. This step ensures that facial dynamics, especially lip synchronization with audio, achieve high precision, overcoming the challenge that the lips occupy only a small region of the facial area.

To ensure identity preservation without hindering motion flexibility, the authors propose replacing traditional reference networks with a more computationally efficient facial-focused cross-attention module. This module directs the attention explicitly to facial consistency across frames. Furthermore, the incorporation of a motion intensity modulation module allows explicit control over facial expressions and body movements, facilitating manipulable portrait animations.

Experimental Results and Metrics

The proposed method, FantasyTalking, demonstrates superior performance across a range of metrics:

FVD and FID: These classical metrics for video and image fidelity indicate the model's capability in generating coherent and high-quality visual content.
Sync-C and Sync-D: Metrics evaluating the audio-visual synchronization, particularly focusing on lip-sync accuracy, show significant improvements, mainly due to the dual-stage alignment approach.
Identity Consistency (IDC) and Subject Dynamics (SD): IDC measures the ability to maintain identity features over time, while SD captures the diversity in motion, both highlighting the effectiveness of the proposed cross-attention and motion control components.

FantasyTalking outperforms existing works not just in faithful lip synchronization but in also creating videos with realistic, diverse motion dynamics that include head and shoulder movements, reflecting nuanced human behaviors.

Implications and Future Directions

The implications of this research are vast for fields like virtual reality, gaming, and any domain requiring realistic avatar synthesis. The dual-stage training and facial-focused identity preservation approach could redefine video generation benchmarks by ensuring that the synthesized animations are not only visually convincing but also contextually consistent with user-driven inputs.

For future work, expanding on this capability includes potential real-time applications and exploring the adaptation of this methodology to more diverse and dynamic input sets, such as multiple face avatars within a single scene or cross-linguistic lip synchronization tasks.

This work stands out for its rigorous alignment techniques and the introduction of novel modules targeting identity consistency and motion diversity, making a stride toward achieving realistic and engaging virtual representations.

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/CodeByPoonam/status/1910997105910669510

https://twitter.com/AdinaYakup/status/1910367385804026264

https://twitter.com/_akhaliq/status/1910247577133466071

https://twitter.com/dreamingtulpa/status/1910962626492064247

https://twitter.com/heyshrutimishra/status/1910305892165439881

https://twitter.com/AIFlow_ML/status/1910249454755836023

YouTube

Show All Videos

HackerNews

Realistic Talking Portrait Generation via Coherent Motion Synthesis (2 points, 0 comments)