Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting (2506.14742v1)

Published 17 Jun 2025 in cs.CV

Abstract: Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability. Additionally, SyncTalk++ enhances robustness to out-of-distribution (OOD) audio by incorporating an Expression Generator and a Torso Restorer, which generate speech-matched facial expressions and seamless torso regions. Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Extensive experiments and user studies demonstrate that SyncTalk++ outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk++.

Summary

  • The paper introduces SyncTalk++, a novel system that integrates 3D Gaussian Splatting with specialized audio-visual encoders to achieve precise lip sync, realistic facial expressions, and stable head poses.
  • It employs modular innovations—such as the Face-Sync Controller, Head-Sync Stabilizer, and Dynamic Portrait Renderer—to disentangle facial features and optimize synchronization across dynamic regions.
  • Experiments show SyncTalk++ outperforms previous methods with a lower LPIPS error and real-time performance (101 FPS), highlighting its practical efficiency and potential in digital media applications.

SyncTalk++ (2506.14742) addresses the significant challenge of generating high-fidelity, synchronized talking head videos from audio input. Achieving a lifelike synthetic talking head requires precise coordination of subject identity, lip movements, facial expressions, and head poses. Existing methods, whether based on 2D generation (GANs, Diffusion Models) or 3D reconstruction (NeRF, earlier Gaussian Splatting methods), struggle to maintain consistency and synchronization across these factors, often leading to artifacts, identity shifts, inaccurate lip movements, and unstable poses, especially with limited training data or out-of-distribution (OOD) audio.

The paper introduces SyncTalk++, a system designed to overcome these limitations by focusing on synchronization and leveraging the efficiency and fidelity of 3D Gaussian Splatting. SyncTalk++ consists of several key components:

  1. Face-Sync Controller: This module is responsible for ensuring accurate lip synchronization and controllable facial expressions.
    • Audio-Visual Encoder: Unlike methods using ASR-trained audio features, SyncTalk++ employs an audio encoder pre-trained on an audio-visual synchronization dataset (LRS2 [afouras2018deep]). This training, supervised by a lip synchronization discriminator, ensures the extracted audio features are specifically relevant to lip movements, improving synchronization accuracy.
    • Facial Animation Capturer: To achieve synchronized and realistic facial expressions, the method utilizes a 3D facial blendshape model with 52 semantically meaningful coefficients [peng2023emotalk]. During training, a network captures facial expressions as features based on these coefficients, focusing on seven core coefficients (related to eyebrows and eyes) that are highly correlated with expressions and independent of lip movements.
    • Facial-Aware Masked-Attention: To prevent interference between lip and expression features, the face is divided into upper and lower regions using the nose tip landmark. Masks are applied during attention mechanisms (Vlip=VMlip,Vexp=VMexpV_{\text{lip}} = V \odot M_{\text{lip}}, V_{\text{exp}} = V \odot M_{\text{exp}}), allowing the network to focus on the respective regions and disentangling the features.
  2. Head-Sync Stabilizer: This module maintains stable head poses, preventing jitter and separation artifacts common in dynamic talking head synthesis.
    • Head Motion Tracker: Sparse 2D facial landmarks are extracted and used with a 3D Morphable Model (BFM [paysan20093d]) to estimate 3D keypoints. The system optimizes focal length, rotation (RR), and translation (TT) by minimizing the projection error between 3DMM landmarks and detected 2D landmarks.
    • Stable Head Points Tracker: Optical flow estimation [yao2022dfa] tracks facial keypoints. A Semantic Weighting module is introduced to assign lower weights to points in highly dynamic regions like eyebrows and eyes, which are prone to erratic movements, improving the stability and accuracy of the estimated head pose parameters (R,TR, T).
    • Bundle Adjustment: A two-stage optimization refines the 3D keypoints and head pose estimations by minimizing alignment errors, resulting in smooth and stable pose parameters.
  3. Dynamic Portrait Renderer: This is the core rendering engine built upon 3D Gaussian Splatting [kerbl20233d].
    • Triplane Gaussian Representation: To model canonical 3D Gaussians representing the average head shape and ensure multi-view consistency, the method uses three orthogonal 2D feature grids (XY, YZ, XZ planes) [chan2022efficient, fridovich2023k, hu2023tri]. Point coordinates are interpolated across these planes to derive fused geometric features (fμf_{\mu}) which are then projected by an MLP (Fcan\mathcal{F}_{\text{can}}) to predict canonical Gaussian attributes (μc,rc,sc,αc,SHc\mu_c, r_c, s_c, \alpha_c, SH_c).
    • Deformation Network: MLPs (Fdeform\mathcal{F}_{\text{deform}}) predict offsets (μ,r,s,α,SH\triangle \mu, \triangle r, \triangle s, \triangle \alpha, \triangle SH) for each Gaussian attribute based on the fused geometric features (fμf_{\mu}), lip feature (flf_l), expression feature (fef_e), and pose (R,TR, T). These offsets are added to the canonical attributes to obtain the deformed 3D Gaussians (Gdeform\mathcal{G}_{\text{deform}}).
    • Optimization and Training: A two-stage training process is used: first, optimizing canonical Gaussian fields using pixel-level, perceptual, and LPIPS losses; then, optimizing the full deformable Gaussian fields, increasing the weight on LPIPS loss.
    • Portrait-Sync Generator: To integrate the rendered facial region with the original high-resolution background and preserve fine details like hair, the 3DGS-rendered face is smoothly blended with the original image, potentially using a Gaussian blur and re-placement based on coordinates.
  4. OOD Audio Expression Generator: This module addresses the mismatch between facial expressions and content when using OOD audio (e.g., from different speakers or TTS). It builds on EmoTalk [peng2023emotalk] to generate speech-matched expressions. To improve generalization to OOD blendshape coefficients from cross-identity sources, a Transformer-based VQ-VAE model [van2017neural] is pre-trained. It encodes blendshape coefficients into a discrete codebook, ensuring that generated coefficients are adapted to the target character's facial features during decoding, mitigating mapping ambiguity and artifacts.
  5. OOD Audio Torso Restorer: This module repairs potential gaps and inconsistencies between the generated head and the original torso, which can arise with OOD audio due to jaw position discrepancies. It employs a lightweight U-Net based inpainting model. The training process simulates these gaps by removing expanded mask areas from source frames, training the network to fill these regions and seamlessly blend the generated head with the torso.

Experiments comparing SyncTalk++ with state-of-the-art 2D and 3D methods demonstrate its superiority. Quantitatively, SyncTalk++ achieves state-of-the-art results across various image quality metrics (PSNR, LPIPS, MS-SSIM, FID, NIQE, BRISQUE, HyperIQA) and synchronization metrics (LMD, AUE, LSE-C/D). It shows a significantly lower LPIPS error (0.0201) compared to previous methods. For lip synchronization with OOD audio, SyncTalk++ achieves a higher LSE-C (6.4633/6.0733) and lower LSE-D (8.0808/8.0217) than other methods. The use of Gaussian Splatting results in high efficiency, with training taking only 1.5 hours per subject on an NVIDIA RTX 4090 GPU and achieving a real-time rendering speed of 101 FPS at 512x512 resolution. Qualitative comparisons show SyncTalk++ producing more precise facial details, accurate lip shapes, proper blinking/eyebrow movements, and stable head poses without head-torso separation, even with challenging long hair. A user paper further validates these findings, with SyncTalk++ receiving the highest perceptual scores across lip-sync accuracy, expression-sync accuracy, pose-sync accuracy, image quality, and video realness. Ablation studies confirm the crucial contribution of each proposed component to the overall performance.

The paper also includes an ethical consideration, acknowledging the potential for misuse in creating deepfakes. The authors suggest mitigation strategies such as improving deepfake detection algorithms (offering to share their work), protecting real training videos, promoting transparency and consent in the use of synthetic media, and establishing legal regulations.

In conclusion, SyncTalk++ presents a robust and efficient solution for synchronized talking head synthesis by effectively integrating improved synchronization mechanisms for lip movements, facial expressions, and head poses with the power of 3D Gaussian Splatting for high-fidelity and real-time rendering. The inclusion of OOD audio handling capabilities further enhances its practicality for real-world applications in digital assistants, VR, and filmmaking.