Papers
Topics
Authors
Recent
2000 character limit reached

Talking Head Generation

Updated 1 February 2026
  • Talking head generation is the computational synthesis of natural-looking talking videos from a single portrait image and driving speech, creating temporally consistent 3D head avatars.
  • It leverages explicit Gaussian splatting to decompose static head geometry from dynamic lip motions, enabling precise and controllable animation without 3D scans or facial landmarks.
  • Recent methods like Splat-Portrait achieve superior performance using audio-driven animation combined with per-splat control, as demonstrated by improved PSNR, SSIM, and lip-sync metrics.

Talking head generation is the computational synthesis of natural-looking talking videos from a single portrait image and driving speech, producing temporally consistent 3D head avatars with synchronized lip motion and plausible view-dependent appearance. Recent advances leverage explicit Gaussian splatting techniques to address challenges of 3D head reconstruction, disentanglement of facial motions, and requirements for minimal data and supervision. Methods such as Splat-Portrait (Shi et al., 26 Jan 2026) achieve high-quality animated avatars by directly controlling per-splat attributes, circumventing prior dependencies on neural implicit fields, facial landmarks, or multi-view datasets.

1. Problem Definition and Motivation

Talking head generation targets the accurate synthesis of facial movements and lip synchronization from speech, using one static portrait image as input. The critical objectives are: (1) reconstructing a high-fidelity, 3D-consistent mesh for the head that generalizes to novel poses; (2) animating the facial region—including accurate lip motion—in real time and in sync with spoken audio; (3) achieving these goals with no domain heuristics (such as blend-shapes or facial landmarks), and without 3D scans or multi-view setups. Prior works relying on NeRFs or 3DMMs entangle geometry and dynamics in implicit fields, leading to artifacts and view-inconsistent animation, while warping and landmark-driven models limit expressiveness and cross-identity generalization. Splat-Portrait innovates by adopting 3D Gaussian splatting, yielding both explicit controllability and separation of static geometry from dynamic facial motion (Shi et al., 26 Jan 2026).

2. Gaussian Splatting Representation

A central component is the representation of the human head as a set of anisotropic 3D Gaussian splats, each parameterized by center μiR3\boldsymbol\mu_i \in \mathbb{R}^3, covariance ΣiR3×3\Sigma_i \in \mathbb{R}^{3\times3}, opacity αi\alpha_i, and RGB color ci\mathbf{c}_i. Each splat contributes a spatial density

ρi(x)=αiexp(12(xμi)Σi1(xμi))\rho_i(\mathbf{x}) = \alpha_i \exp\Bigl(-\frac{1}{2} (\mathbf{x} - \boldsymbol\mu_i)^\top \Sigma_i^{-1} (\mathbf{x} - \boldsymbol\mu_i)\Bigr)

with overall field

ρ(x)=iρi(x),c(x)=iρi(x)ciρ(x)\rho(\mathbf{x}) = \sum_i \rho_i(\mathbf{x}), \quad \mathbf{c}(\mathbf{x}) = \frac{\sum_i \rho_i(\mathbf{x}) \mathbf{c}_i}{\rho(\mathbf{x})}

Rendering is performed by differentiable splat-based rasterization R\mathcal{R}, projecting each Gaussian's density into screen space and compositing via ordered alpha blending (Shi et al., 26 Jan 2026). This explicit decomposition enables real-time rendering and post-hoc editing of facial attributes. Splat-Portrait further separates head splats from a whole-image inpainted 2D background, allowing for occlusion and authentic background reconstruction as view changes.

3. Model Architecture and Audio-Driven Animation

The model consists of two principal networks:

  • Static Generator (SG): A U-Net backbone that predicts, per-pixel, Gaussian splat attributes {α,o,s,d,Δs,r,c}\{\alpha, o, s, d, \Delta_s, r, \mathbf{c}\} and the 2D background BB, given an input image IiI_i. These parameters yield the static 3D configuration.
  • AudioNet/AudioAttNet: Extract frame-level speech features AnA_n via Wav2Vec2-XLSR 53 and process them through a 1D-convolution stack, followed by attention pooling into a $256$-D audio embedding.
  • Dynamic Decoder: Conditioned on static SG features and fused [audio,ΔT][\text{audio}, \Delta T] temporal embeddings via FiLM, the dynamic decoder outputs per-pixel dynamic offsets ΔdR3\Delta_d \in \mathbb{R}^3 to the splat positions. Lip motions at time TnT_n are synthesized by updating splat positions: μi(Tn)=μistatic+Δd,i(Tn)\boldsymbol\mu_i(T_n) = \boldsymbol\mu_i^{\text{static}} + \Delta_{d,i}(T_n), while other attributes remain static (Shi et al., 26 Jan 2026).

Temporal coordination utilizes either sinusoidal positional encoding or a Fourier map to embed time deltas. This architecture enables framewise lip synchronization directly from audio without motion-driven priors or landmarks.

4. Training Objectives and Data Protocols

Training is accomplished via purely image-space losses—no 3D scan, blend-shape, or landmark supervision is employed. The major losses are:

  • Static Reconstruction Loss:

Lstatic=IiIi2+InIn2+λ(LPIPS(Ii,Ii)+LPIPS(In,In))\mathcal{L}_{\text{static}} = \|I_i - I_i^*\|_2 + \|I_n - I_n^*\|_2 + \lambda \left(\text{LPIPS}(I_i, I_i^*) + \text{LPIPS}(I_n, I_n^*)\right)

where λ=0.01\lambda = 0.01 and Ii,InI_i^*, I_n^* are renders under static splat parameters.

  • Dynamic Reconstruction Loss:

Ldynamic=IiIi2+InIn2+λ(LPIPS(Ii,Ii)+LPIPS(In,In))\mathcal{L}_{\text{dynamic}} = \|I_i - I_i^*\|_2 + \|I_n - I_n^{**}\|_2 + \lambda \left(\text{LPIPS}(I_i, I_i^*) + \text{LPIPS}(I_n, I_n^{**})\right)

with InI_n^{**} using dynamic (audio-driven) offsets.

  • Score-Distillation Loss (SDS): Applied to extreme camera poses (±45(\pm45^\circ yaw, ±12.5\pm12.5^\circ pitch), rendering a clean image xcleanx_{\text{clean}} and denoised xdenoisedx_{\text{denoised}} via the diffusion process, penalizing

LSDS=xcleanxdenoised2\mathcal{L}_{\text{SDS}} = \|x_{\text{clean}} - x_{\text{denoised}}\|_2

Total loss combines reconstruction and SDS terms for static pretraining and dynamic fine-tuning stages (Shi et al., 26 Jan 2026).

Datasets include HDTF (\approx400 clips, 350 subjects) and TalkingHead-1KH (1100 videos, 300–10,000 frames each) at 256×256256\times256 resolution, audio at 16 kHz, and camera parameters estimated via 3DMM fitting. Training is conducted for 200 epochs (static) and 50 epochs (dynamic) using AdamW.

5. Experimental Evaluation and Comparative Analysis

Empirical results demonstrate the superiority of Splat-Portrait on standard metrics for HDTF and TH-1KH:

  • Same-identity test splits:
    • PSNR: $23.87$ (vs next best $23.08$)
    • SSIM: $0.814$ (vs $0.786$)
    • LPIPS: $0.128$ (vs $0.182$)
    • CSIM (identity): $0.811$ (vs $0.753$)
    • FID: $25.58$ (vs $28.63$–$37.89$)
    • LipSync error: $6.328$ (vs $6.681$ for Real3D-Portrait) (Shi et al., 26 Jan 2026)
  • Cross-identity:
    • FID: $28.62$ (best among competitors)
    • CSIM: $0.726$ (best among competitors)

Qualitative results show sharper edge details, smoother depth reconstruction, plausible backgrounds revealed under head rotation, and consistent lip synch to arbitrary speech. Splat-Portrait achieves disentangled static and dynamic control with no facial landmarks or 3D scan supervision.

6. Technical Innovations and Limitations

Innovations include:

  • Explicit control: Per-splat attributes enable direct manipulation for animation.
  • Separation of static and dynamic geometry: Head structure remains stable while dynamic offsets synthesize lip motion.
  • Absence of domain heuristics: No blend-shapes, warping priors, or landmark supervision.
  • Background handling: Inpainted, view-dependent 2D backgrounds eliminate “floating head” artifacts.

Limitations are observed for extreme poses (beyond ±45\pm45^\circ yaw induces mild blurring), current dynamic decoder articulates only lip motion (not full facial expression), and static backgrounds do not encode camera or scene motion. A plausible implication is that full 4D geometry reconstruction and explicit emotion tokens could extend the framework.

7. Relation to Adjacent Methodologies and Future Directions

Relative to HumanSplat (Pan et al., 2024), which predicts Gaussian splatting representations for entire human bodies from single images using multi-view diffusion priors and latent Transformers, Splat-Portrait specializes in talking head synthesis and augments the explicit splatting approach with audio-driven animation. HumanSplat introduces a structure-aware Transformer fusing multi-view latents with SMPL mesh priors, and hierarchical semantic losses, targeting photorealistic full-body reconstruction; Splat-Portrait generalizes these principles for head animation, dispensing with geometric priors in favor of data-driven disentanglement.

Future directions suggested include:

  • Extending dynamic decoders for non-rigid motion in cheeks and brows
  • Explicit emotion conditioning
  • Video-driven or multi-view lifted diffusion priors
  • Enhanced background modeling for dynamic scenes

These advances would further expand the applicability of Gaussian splatting approaches in talking head generation and general human avatar synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Talking Head Generation.