FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait (2412.01064v3)

Published 2 Dec 2024 in cs.CV, cs.AI, cs.LG, cs.MM, and eess.IV

Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

Summary

The paper introduces FLOAT, a novel generative model that leverages flow matching in a motion latent space to produce high-resolution, temporally consistent talking portraits.
The methodology employs a motion auto-encoder and a Flow Matching Transformer to decouple facial motion from identity, enabling efficient and expressive lip sync and head movements.
Experimental results demonstrate superior visual quality and motion fidelity, with key metrics like FID, FVD, and LSE-D supporting its potential for real-time avatar and VR applications.

An In-depth Analysis of "FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait"

The paper "FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait" introduces a novel approach to the challenging task of generating audio-driven talking portraits from a single image. The proposed method, termed FLOAT, leverages a flow matching generative model to enhance the temporal consistency of generated videos while reducing the typical sampling inefficiencies associated with diffusion-based methods.

FLOAT stands out by shifting from the conventional pixel-based generative modeling to a novel motion latent space approach. This transition is facilitated by employing a motion auto-encoder, which allows for the decomposition of motion and identity in the latent space, thereby capturing a broad spectrum of facial and head movements efficiently. The auto-encoder's architecture builds upon latent image animator (LIA) principles, scaling its capabilities to produce high-resolution outputs, which manifests as improved visual fidelity and motion expressiveness in the generated videos.

A critical innovation presented in this paper is the use of a transformer-based vector field predictor—dubbed the Flow Matching Transformer (FMT). The FMT decouples frame-wise conditioning from the attention mechanism and excels in generating smooth, temporally coherent motion latents. This design aligns with the flow matching technique, a notable alternative to diffusion models, offering high-speed sampling and quality that rivals existing state-of-the-art diffusion models.

FLOAT introduces the incorporation of speech-driven emotion enhancement, allowing for the natural inclusion of expressive non-verbal motions. By incorporating speech-driven emotion labels, the model not only synchronizes lip movements but also aligns head poses and facial expressions with the speaker's emotional tone, thus enhancing the realism of the talking portraits.

Experimental evaluations show that FLOAT achieves superior performance across a suite of benchmarks when compared to existing audio-driven talking portrait generation methods. Key quantitative metrics, including FID, FVD, and LSE-D, indicate that this approach reaches new levels of visual quality, motion fidelity, and efficiency. Additionally, FLOAT's ability to utilize a low number (as low as 10) of function evaluations during the ODE sampling process represents a substantial leap toward real-time applications.

Looking forward, the implications of FLOAT extend beyond the immediate generation of talking portraits. Given its foundational basis in motion latent spaces and flow matching models, the framework holds promise for broader applications in avatar creation, virtual reality, and animation, where temporal coherence and expressiveness are paramount.

In conclusion, FLOAT presents a compelling advancement in audio-driven portrait video generation, addressing key limitations of existing methods through innovative uses of motion latent spaces and flow matching. This work opens avenues for further refinement and application of generative models in nuanced scenarios demanding high-quality temporal dynamics and emotional expressiveness. Future exploration could focus on expanding the granularity of emotional labels and incorporating additional modalities to enhance the versatility and application scope of the model.

Related Papers

Tweets

https://twitter.com/EHuanglu/status/1864154799924772933

https://twitter.com/_akhaliq/status/1863824176575852552

https://twitter.com/taekyungki/status/1864099802751136246

https://twitter.com/taekyungki/status/1939238445714702519

https://twitter.com/javaeeeee1/status/1865757656670970159

YouTube

Show All Videos