- The paper introduces Sonic, a novel framework for audio-driven talking face generation that leverages global audio perception as the primary driver, moving away from auxiliary visual cues.
- Sonic features a Context-enhanced Audio Learning module for extracting multi-scale audio features and a Motion-decoupled Controller for disentangling and controlling head and expression movements.
- Quantitative and qualitative experiments demonstrate Sonic's superior performance in video quality, lip synchronization, expression diversity, and temporal consistency compared to existing methods.
The paper "Sonic: Shifting Focus to Global Audio Perception in Portrait Animation" introduces a novel approach to audio-driven talking face generation. The core idea is to leverage global audio perception as the primary driver for facial animation, moving away from reliance on auxiliary visual cues or spatial knowledge that often compromise naturalness and temporal consistency. The authors argue that audio, as a unique and global signal, provides inherent priors for facial expressions and lip movements, and should be the central element in guiding animation.
The proposed framework, named Sonic, achieves this through three key components:
- Context-enhanced Audio Learning: This module is responsible for extracting both short-range and long-range temporal audio knowledge from the input audio clip. It uses a pre-trained audio feature extraction model (Whisper-Tiny) to obtain multi-scale audio features. These features are then used to guide both spatial and temporal aspects of the generated video. Spatially, audio features are injected into spatial cross-attention layers, focusing on the talking face area using a mask. Temporally, a custom temporal-audio cross-attention module is introduced to align the video with the audio over time. This is done by projecting audio features into temporal embeddings, reducing computational burden while capturing temporal priors within the audio clip. This effectively captures the tone and speed of speech to implicitly derive expression and head movement motion priors.
- Motion-decoupled Controller: This module addresses habitual head movements and subtle expression variations that are not directly tied to the audio content. It disentangles head and expression movements, allowing for independent control through explicit parameters. During training, translation and expression motion buckets are calculated based on variance in bounding boxes and facial landmarks respectively. These buckets are then used to modulate ResNet blocks, providing control over head movement and expression intensity. Importantly, this component is designed to be user-controllable for added playability, but can also be learned from the input audio and a reference image eliminating the need for parameter tuning.
- Time-aware Position Shift Fusion: This module is designed to enhance global inter-clip audio perception for generating long, temporally coherent videos. It addresses limitations of existing methods that rely on motion frames or overlapping latents, which have limited receptive fields and increased computational complexity. The time-aware position shift fusion progressively fuses inter-clip latent features across global audio perception by using consecutively time-aware shifted windows. In essence, the model processes audio clips in parallel, but at each time step, the starting position of the audio clip is shifted, creating a sliding window effect. This enables the model to integrate long-range audio context without introducing extra training cost or inference time. A circular padding strategy ensures seamless handling of shifted positions exceeding the sequence length.
The authors present extensive experiments to demonstrate the effectiveness of Sonic. Quantitative evaluations on HDTF and CelebV-HQ datasets show superior performance in terms of video quality (FID, FVD), lip synchronization (Sync-C, Sync-D), expression diversity (E-FID), and smoothness. Qualitative comparisons further highlight the naturalness and temporal consistency of Sonic's output. Ablation studies validate the contribution of each component, and user studies confirm the subjective improvements in lip sync, motion diversity, identity consistency, and video smoothness. The paper also demonstrates fine-grained control over lip, expression, and motion through adjustable parameters.