Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SyncTalk++: Real-Time Talking Head Synthesis

Updated 8 July 2025

SyncTalk++ is a high-fidelity talking head synthesis system that robustly synchronizes speech with detailed facial and head motion for photorealistic output.
It employs a dynamic portrait renderer with Gaussian splatting to achieve multi-view-consistent, real-time rendering at up to 101 FPS.
The system integrates specialized audio-visual, facial animation, and head stabilization modules to ensure precise lip-sync, expression alignment, and natural head movement.

SyncTalk++ is a high-fidelity, efficient talking head synthesis system that focuses on accurate and robust synchronization between speech audio and video attributes—including lip movements, facial expressions, and head poses—while ensuring subject identity preservation and visual realism. Distinctly, SyncTalk++ introduces a Dynamic Portrait Renderer utilizing Gaussian Splatting for real-time, multi-view-consistent video generation, combined with a suite of specialized synchronization controllers and robustness modules to deliver state-of-the-art performance in both subjective and objective measures of quality and synchronization accuracy (2506.14742).

1. Dynamic Portrait Renderer with Gaussian Splatting

SyncTalk++ implements an explicit 3D dynamic portrait renderer using Gaussian Splatting to encode and animate detailed facial geometry and appearance. The system represents a face with $N$ learnable 3D Gaussian primitives, each defined by center $\mu \in \mathbb{R}^3$ , covariance matrix $\Sigma$ , opacity $\alpha$ , and spherical harmonics parameters SH for view-dependent color:

$g(x) = \exp\left(-\frac{1}{2}(x-\mu)^\top \Sigma^{-1}(x-\mu)\right)$

Covariances are decomposed into a rotation $R$ (expressed as a quaternion $r$ ) and a non-negative scale $S$ (vector $s$ ). The renderer projects Gaussians onto the image plane using the camera’s transformation matrix $W$ and local affine $J$ , computing:

$\Sigma' = JW\Sigma W^\top J^\top$

Colors at pixel locations arise from blending Gaussians in depth-sorted order using:

$\hat{C}(r) = \sum_{i=1}^N c_i ~ \tilde{\alpha}_i \prod_{j=1}^{i-1} (1-\tilde{\alpha}_j)$

Triplane features—interpolated from XY, YZ, and XZ planes—are fused and input to an MLP $\mathcal{F}_{can}$ predicting canonical Gaussian parameters:

$\mathcal{F}_{can}(f_{[\mu]}) = \{\mu_c, r_c, s_c, \alpha_c, SH_c\}$

Animation is achieved via a deformation MLP $\mathcal{F}_{deform}$ , which modifies these parameters according to audio-visual features $f_l$ (lips), $f_e$ (expression), and pose $(R,T)$ :

$\mathcal{F}_{deform}(f_{[\mu]}, f_l, f_e, R,T) = \{\Delta\mu, \Delta r, \Delta s, \Delta\alpha, \Delta SH\}$

Resulting deformed parameters are applied:

$\mathcal{G}_{deform} = \{\mu_c+\Delta\mu, r_c+\Delta r, s_c+\Delta s, \alpha_c+\Delta\alpha, SH_c+\Delta SH\}$

The renderer is trained with pixel-wise $L_1$ , perceptual, and LPIPS losses, enforcing multi-view consistency for identity preservation. This explicit and differentiable rendering backbone, in conjunction with modular pose and expression controls, enables subject-specific, robust, and high-speed synthesis—reportedly reaching 101 FPS at $512 \times 512$ .

2. Face-Sync Controller

The Face-Sync Controller is a dual-component synchronization mechanism aligning lip movements precisely with the driving speech:

a. Audio-Visual Encoder:

Unlike ASR-based encoders, SyncTalk++ employs an audio encoder specifically trained for audiovisual alignment on LRS2. Given continuous face window $F$ and audio $A$ , synchronization is supervised with:

Cosine similarity: $sim(F,A) = \frac{F \cdot A}{\|F\|_2 \|A\|_2}$
Binary cross-entropy loss:

$L_{sync} = -[y \log(sim(F,A)) + (1-y)\log(1-sim(F,A))]$

A parallel $L_1$ reconstruction loss is imposed on decoded $F$ from concatenated $Conv(A) \oplus Conv(F)$ features, promoting robust cross-modal correspondence. After pre-training, only the audio feature extractor is retained for runtime lip feature extraction.

b. Facial Animation Capturer:

Facial expressions are parameterized with a 3D blendshape model using 52 coefficients $B$ . For efficiency, seven semantically meaningful coefficients (e.g., brow, blink) form a reduced representation $E_{core} = \sum_j w_j \cdot B_j$ . Facial features are separated into lips and expressions through a masked-attention mechanism, effectively disentangling $f_l$ (lips) and $f_e$ (expression), reducing mutual interference during synthesis.

3. Head-Sync Stabilizer

The Head-Sync Stabilizer ensures stable, natural head movement by fusing multiple pose estimation and refinement techniques:

Head Motion Tracker:

A 3D Morphable Model (3DMM) fits 2D landmarks ( $L_{2D}$ ) to 3D keypoints using optimal focal length $f_{opt}$ , minimizing projection error:

$f_{opt} = \arg\min_{f_i} E_i(L_{2D}, L_{3D}(f_i, R_i, T_i))$

with further refinement of $(R, T)$ .

Stable Keypoint Tracking and Bundle Adjustment:

Optical flow $\mathcal{F}(x_f, y_f, t_f) = (u_f,v_f)$ computes facial motion, and a Laplacian filter selects significant keypoints ( $\mathcal{L}(\mathcal{F}(k)) > \theta$ ). A semantic weighting module de-emphasizes unstable/expression regions (e.g., eyes, eyebrows). Two-stage optimization first aligns projected 3D ( $P_j$ ) and tracked ( $K_j$ ) points:

$L_{init} = \sum_j \|P_j - K_j''\|_2$

and then refines all parameters:

$L_{sec} = \sum_j \|P_j(R,T)-K_j''\|_2$

This results in temporally stable and physically plausible head motion across frames.

4. Robustness to Out-of-Distribution Audio

SyncTalk++ incorporates two targeted modules to address the challenges posed by OOD audio (where input speech differs in identity, affect, or prosody from training data):

a. OOD Audio Expression Generator:

An extension of EmoTalk, this module uses a Transformer-based VQ-VAE. The encoder $E$ transforms blendshape coefficients $B$ into latent codes $\mathcal{Z}_e$ , quantized using a codebook $Z$ :

$Z_q = Q(\mathcal{Z}_e) = \arg\min_{z_k\in Z} \|\mathcal{Z}_e - z_k\|_2$

Reconstruction through decoder $D$ gives $\tilde{B} = D(Z_q)$ . The objective comprises reconstruction and commitment losses:

$\mathcal{L} = \|B - \tilde{B}\|^2 + \|\mathcal{Z}_e - sg(Z_q)\|^2 + \beta \cdot \|sg(\mathcal{Z}_e) - Z_q\|^2$

This facilitates appropriately expressive faces in OOD scenarios.

b. OOD Audio Torso Restorer:

A lightweight U-Net inpainting network repairs head–torso boundary inconsistencies—common when jaw or neck pose shifts. Random mask expansion simulates missing regions during training; the model inpaints the missing torso as:

$\hat{F}_{source} = \mathcal{I}(MF_{source}, (1-M-\delta_{rand}) F_{source}; \theta)$

using $L_1$ and LPIPS losses. At inference, a fixed boundary (e.g. 15 pixels) is inpainted to ensure seamless blending.

5. Quantitative Performance and Practical Evaluation

SyncTalk++ demonstrates simultaneous advances in both quality and speed:

Rendering: Real-time synthesis at up to 101 FPS ( $512\times512$ ), substantially faster than prior 3D approaches.
Image Quality: Significant improvements in PSNR, LPIPS, MS-SSIM, and FID compared to both 2D and prior 3D/NeRF methods.
Lip Synchronization: State-of-the-art LMD and Lip Sync Error Confidence (LSE-C) on both IID and OOD datasets.
Training Efficiency: Fast subject adaptation; new subject training requires only about 1.5 hours.
User Studies: In comprehensive perceptual studies ( $>40$ participants), SyncTalk++ was consistently top-rated for lip-sync accuracy, expression-sync, pose-sync, image quality, and overall video realism.
Ablation Experiments: Each architecture module (audio-visual encoder, 3D blendshapes, head stabilizer, OOD modules) was shown essential for optimal performance, with observed quality and synchronization degradation upon removal.

6. Application Domains and Significance

SyncTalk++'s combination of lip-audio alignment, blendshaped facial expressions, head pose stability, high fidelity, and real-time performance addresses critical requirements across:

Digital avatars and virtual assistants requiring realism and interactive responsiveness
Film/animation production seeking reduced post-production effort in dubbing or lip-sync
Real-time telepresence and video communication with enhanced expression and synchronization
Robust media synthesis in settings with unknown or cross-domain audio sources

Its generalization to OOD input broadens applicability, while the explicit, modular design supports future integration of improved speech feature representation, 3D expression modeling, and further rendering optimizations.

SyncTalk++ constitutes a system-level advance that merges explicit 3D dynamic rendering and specialized synchronization mechanisms, setting new standards in efficient and synchronized talking head video synthesis (2506.14742).

PDF Markdown Chat (Upgrade)

References (1)

SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting (2025)