Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SyncTalk++: Real-Time Talking Head Synthesis

Updated 8 July 2025
  • SyncTalk++ is a high-fidelity talking head synthesis system that robustly synchronizes speech with detailed facial and head motion for photorealistic output.
  • It employs a dynamic portrait renderer with Gaussian splatting to achieve multi-view-consistent, real-time rendering at up to 101 FPS.
  • The system integrates specialized audio-visual, facial animation, and head stabilization modules to ensure precise lip-sync, expression alignment, and natural head movement.

SyncTalk++ is a high-fidelity, efficient talking head synthesis system that focuses on accurate and robust synchronization between speech audio and video attributes—including lip movements, facial expressions, and head poses—while ensuring subject identity preservation and visual realism. Distinctly, SyncTalk++ introduces a Dynamic Portrait Renderer utilizing Gaussian Splatting for real-time, multi-view-consistent video generation, combined with a suite of specialized synchronization controllers and robustness modules to deliver state-of-the-art performance in both subjective and objective measures of quality and synchronization accuracy (2506.14742).

1. Dynamic Portrait Renderer with Gaussian Splatting

SyncTalk++ implements an explicit 3D dynamic portrait renderer using Gaussian Splatting to encode and animate detailed facial geometry and appearance. The system represents a face with NN learnable 3D Gaussian primitives, each defined by center μR3\mu \in \mathbb{R}^3, covariance matrix Σ\Sigma, opacity α\alpha, and spherical harmonics parameters SH for view-dependent color:

g(x)=exp(12(xμ)Σ1(xμ))g(x) = \exp\left(-\frac{1}{2}(x-\mu)^\top \Sigma^{-1}(x-\mu)\right)

Covariances are decomposed into a rotation RR (expressed as a quaternion rr) and a non-negative scale SS (vector ss). The renderer projects Gaussians onto the image plane using the camera’s transformation matrix WW and local affine JJ, computing:

Σ=JWΣWJ\Sigma' = JW\Sigma W^\top J^\top

Colors at pixel locations arise from blending Gaussians in depth-sorted order using:

C^(r)=i=1Nci α~ij=1i1(1α~j)\hat{C}(r) = \sum_{i=1}^N c_i ~ \tilde{\alpha}_i \prod_{j=1}^{i-1} (1-\tilde{\alpha}_j)

Triplane features—interpolated from XY, YZ, and XZ planes—are fused and input to an MLP Fcan\mathcal{F}_{can} predicting canonical Gaussian parameters:

Fcan(f[μ])={μc,rc,sc,αc,SHc}\mathcal{F}_{can}(f_{[\mu]}) = \{\mu_c, r_c, s_c, \alpha_c, SH_c\}

Animation is achieved via a deformation MLP Fdeform\mathcal{F}_{deform}, which modifies these parameters according to audio-visual features flf_l (lips), fef_e (expression), and pose (R,T)(R,T):

Fdeform(f[μ],fl,fe,R,T)={Δμ,Δr,Δs,Δα,ΔSH}\mathcal{F}_{deform}(f_{[\mu]}, f_l, f_e, R,T) = \{\Delta\mu, \Delta r, \Delta s, \Delta\alpha, \Delta SH\}

Resulting deformed parameters are applied:

Gdeform={μc+Δμ,rc+Δr,sc+Δs,αc+Δα,SHc+ΔSH}\mathcal{G}_{deform} = \{\mu_c+\Delta\mu, r_c+\Delta r, s_c+\Delta s, \alpha_c+\Delta\alpha, SH_c+\Delta SH\}

The renderer is trained with pixel-wise L1L_1, perceptual, and LPIPS losses, enforcing multi-view consistency for identity preservation. This explicit and differentiable rendering backbone, in conjunction with modular pose and expression controls, enables subject-specific, robust, and high-speed synthesis—reportedly reaching 101 FPS at 512×512512 \times 512.

2. Face-Sync Controller

The Face-Sync Controller is a dual-component synchronization mechanism aligning lip movements precisely with the driving speech:

a. Audio-Visual Encoder:

Unlike ASR-based encoders, SyncTalk++ employs an audio encoder specifically trained for audiovisual alignment on LRS2. Given continuous face window FF and audio AA, synchronization is supervised with:

  • Cosine similarity: sim(F,A)=FAF2A2sim(F,A) = \frac{F \cdot A}{\|F\|_2 \|A\|_2}
  • Binary cross-entropy loss:

Lsync=[ylog(sim(F,A))+(1y)log(1sim(F,A))]L_{sync} = -[y \log(sim(F,A)) + (1-y)\log(1-sim(F,A))]

A parallel L1L_1 reconstruction loss is imposed on decoded FF from concatenated Conv(A)Conv(F)Conv(A) \oplus Conv(F) features, promoting robust cross-modal correspondence. After pre-training, only the audio feature extractor is retained for runtime lip feature extraction.

b. Facial Animation Capturer:

Facial expressions are parameterized with a 3D blendshape model using 52 coefficients BB. For efficiency, seven semantically meaningful coefficients (e.g., brow, blink) form a reduced representation Ecore=jwjBjE_{core} = \sum_j w_j \cdot B_j. Facial features are separated into lips and expressions through a masked-attention mechanism, effectively disentangling flf_l (lips) and fef_e (expression), reducing mutual interference during synthesis.

3. Head-Sync Stabilizer

The Head-Sync Stabilizer ensures stable, natural head movement by fusing multiple pose estimation and refinement techniques:

  • Head Motion Tracker:

A 3D Morphable Model (3DMM) fits 2D landmarks (L2DL_{2D}) to 3D keypoints using optimal focal length foptf_{opt}, minimizing projection error:

fopt=argminfiEi(L2D,L3D(fi,Ri,Ti))f_{opt} = \arg\min_{f_i} E_i(L_{2D}, L_{3D}(f_i, R_i, T_i))

with further refinement of (R,T)(R, T).

  • Stable Keypoint Tracking and Bundle Adjustment:

Optical flow F(xf,yf,tf)=(uf,vf)\mathcal{F}(x_f, y_f, t_f) = (u_f,v_f) computes facial motion, and a Laplacian filter selects significant keypoints (L(F(k))>θ\mathcal{L}(\mathcal{F}(k)) > \theta). A semantic weighting module de-emphasizes unstable/expression regions (e.g., eyes, eyebrows). Two-stage optimization first aligns projected 3D (PjP_j) and tracked (KjK_j) points:

Linit=jPjKj2L_{init} = \sum_j \|P_j - K_j''\|_2

and then refines all parameters:

Lsec=jPj(R,T)Kj2L_{sec} = \sum_j \|P_j(R,T)-K_j''\|_2

This results in temporally stable and physically plausible head motion across frames.

4. Robustness to Out-of-Distribution Audio

SyncTalk++ incorporates two targeted modules to address the challenges posed by OOD audio (where input speech differs in identity, affect, or prosody from training data):

a. OOD Audio Expression Generator:

An extension of EmoTalk, this module uses a Transformer-based VQ-VAE. The encoder EE transforms blendshape coefficients BB into latent codes Ze\mathcal{Z}_e, quantized using a codebook ZZ:

Zq=Q(Ze)=argminzkZZezk2Z_q = Q(\mathcal{Z}_e) = \arg\min_{z_k\in Z} \|\mathcal{Z}_e - z_k\|_2

Reconstruction through decoder DD gives B~=D(Zq)\tilde{B} = D(Z_q). The objective comprises reconstruction and commitment losses:

L=BB~2+Zesg(Zq)2+βsg(Ze)Zq2\mathcal{L} = \|B - \tilde{B}\|^2 + \|\mathcal{Z}_e - sg(Z_q)\|^2 + \beta \cdot \|sg(\mathcal{Z}_e) - Z_q\|^2

This facilitates appropriately expressive faces in OOD scenarios.

b. OOD Audio Torso Restorer:

A lightweight U-Net inpainting network repairs head–torso boundary inconsistencies—common when jaw or neck pose shifts. Random mask expansion simulates missing regions during training; the model inpaints the missing torso as:

F^source=I(MFsource,(1Mδrand)Fsource;θ)\hat{F}_{source} = \mathcal{I}(MF_{source}, (1-M-\delta_{rand}) F_{source}; \theta)

using L1L_1 and LPIPS losses. At inference, a fixed boundary (e.g. 15 pixels) is inpainted to ensure seamless blending.

5. Quantitative Performance and Practical Evaluation

SyncTalk++ demonstrates simultaneous advances in both quality and speed:

  • Rendering: Real-time synthesis at up to 101 FPS (512×512512\times512), substantially faster than prior 3D approaches.
  • Image Quality: Significant improvements in PSNR, LPIPS, MS-SSIM, and FID compared to both 2D and prior 3D/NeRF methods.
  • Lip Synchronization: State-of-the-art LMD and Lip Sync Error Confidence (LSE-C) on both IID and OOD datasets.
  • Training Efficiency: Fast subject adaptation; new subject training requires only about 1.5 hours.
  • User Studies: In comprehensive perceptual studies (>40>40 participants), SyncTalk++ was consistently top-rated for lip-sync accuracy, expression-sync, pose-sync, image quality, and overall video realism.
  • Ablation Experiments: Each architecture module (audio-visual encoder, 3D blendshapes, head stabilizer, OOD modules) was shown essential for optimal performance, with observed quality and synchronization degradation upon removal.

6. Application Domains and Significance

SyncTalk++'s combination of lip-audio alignment, blendshaped facial expressions, head pose stability, high fidelity, and real-time performance addresses critical requirements across:

  • Digital avatars and virtual assistants requiring realism and interactive responsiveness
  • Film/animation production seeking reduced post-production effort in dubbing or lip-sync
  • Real-time telepresence and video communication with enhanced expression and synchronization
  • Robust media synthesis in settings with unknown or cross-domain audio sources

Its generalization to OOD input broadens applicability, while the explicit, modular design supports future integration of improved speech feature representation, 3D expression modeling, and further rendering optimizations.


SyncTalk++ constitutes a system-level advance that merges explicit 3D dynamic rendering and specialized synchronization mechanisms, setting new standards in efficient and synchronized talking head video synthesis (2506.14742).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)