Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation (2411.16331v3)

Published 25 Nov 2024 in cs.MM, cs.GR, cs.SD, eess.AS, and cs.CV

Abstract: The study of talking face generation mainly explores the intricacies of synchronizing facial movements and crafting visually appealing, temporally-coherent animations. However, due to the limited exploration of global audio perception, current approaches predominantly employ auxiliary visual and spatial knowledge to stabilize the movements, which often results in the deterioration of the naturalness and temporal inconsistencies.Considering the essence of audio-driven animation, the audio signal serves as the ideal and unique priors to adjust facial expressions and lip movements, without resorting to interference of any visual signals. Based on this motivation, we propose a novel paradigm, dubbed as Sonic, to {s}hift f{o}cus on the exploration of global audio per{c}ept{i}o{n}.To effectively leverage global audio knowledge, we disentangle it into intra- and inter-clip audio perception and collaborate with both aspects to enhance overall perception.For the intra-clip audio perception, 1). \textbf{Context-enhanced audio learning}, in which long-range intra-clip temporal audio knowledge is extracted to provide facial expression and lip motion priors implicitly expressed as the tone and speed of speech. 2). \textbf{Motion-decoupled controller}, in which the motion of the head and expression movement are disentangled and independently controlled by intra-audio clips. Most importantly, for inter-clip audio perception, as a bridge to connect the intra-clips to achieve the global perception, \textbf{Time-aware position shift fusion}, in which the global inter-clip audio information is considered and fused for long-audio inference via through consecutively time-aware shifted windows. Extensive experiments demonstrate that the novel audio-driven paradigm outperform existing SOTA methodologies in terms of video quality, temporally consistency, lip synchronization precision, and motion diversity.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Sonic, a novel framework for audio-driven talking face generation that leverages global audio perception as the primary driver, moving away from auxiliary visual cues.
  • Sonic features a Context-enhanced Audio Learning module for extracting multi-scale audio features and a Motion-decoupled Controller for disentangling and controlling head and expression movements.
  • Quantitative and qualitative experiments demonstrate Sonic's superior performance in video quality, lip synchronization, expression diversity, and temporal consistency compared to existing methods.

The paper "Sonic: Shifting Focus to Global Audio Perception in Portrait Animation" introduces a novel approach to audio-driven talking face generation. The core idea is to leverage global audio perception as the primary driver for facial animation, moving away from reliance on auxiliary visual cues or spatial knowledge that often compromise naturalness and temporal consistency. The authors argue that audio, as a unique and global signal, provides inherent priors for facial expressions and lip movements, and should be the central element in guiding animation.

The proposed framework, named Sonic, achieves this through three key components:

  1. Context-enhanced Audio Learning: This module is responsible for extracting both short-range and long-range temporal audio knowledge from the input audio clip. It uses a pre-trained audio feature extraction model (Whisper-Tiny) to obtain multi-scale audio features. These features are then used to guide both spatial and temporal aspects of the generated video. Spatially, audio features are injected into spatial cross-attention layers, focusing on the talking face area using a mask. Temporally, a custom temporal-audio cross-attention module is introduced to align the video with the audio over time. This is done by projecting audio features into temporal embeddings, reducing computational burden while capturing temporal priors within the audio clip. This effectively captures the tone and speed of speech to implicitly derive expression and head movement motion priors.
  2. Motion-decoupled Controller: This module addresses habitual head movements and subtle expression variations that are not directly tied to the audio content. It disentangles head and expression movements, allowing for independent control through explicit parameters. During training, translation and expression motion buckets are calculated based on variance in bounding boxes and facial landmarks respectively. These buckets are then used to modulate ResNet blocks, providing control over head movement and expression intensity. Importantly, this component is designed to be user-controllable for added playability, but can also be learned from the input audio and a reference image eliminating the need for parameter tuning.
  3. Time-aware Position Shift Fusion: This module is designed to enhance global inter-clip audio perception for generating long, temporally coherent videos. It addresses limitations of existing methods that rely on motion frames or overlapping latents, which have limited receptive fields and increased computational complexity. The time-aware position shift fusion progressively fuses inter-clip latent features across global audio perception by using consecutively time-aware shifted windows. In essence, the model processes audio clips in parallel, but at each time step, the starting position of the audio clip is shifted, creating a sliding window effect. This enables the model to integrate long-range audio context without introducing extra training cost or inference time. A circular padding strategy ensures seamless handling of shifted positions exceeding the sequence length.

The authors present extensive experiments to demonstrate the effectiveness of Sonic. Quantitative evaluations on HDTF and CelebV-HQ datasets show superior performance in terms of video quality (FID, FVD), lip synchronization (Sync-C, Sync-D), expression diversity (E-FID), and smoothness. Qualitative comparisons further highlight the naturalness and temporal consistency of Sonic's output. Ablation studies validate the contribution of each component, and user studies confirm the subjective improvements in lip sync, motion diversity, identity consistency, and video smoothness. The paper also demonstrates fine-grained control over lip, expression, and motion through adjustable parameters.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com