Talking Head Generation

Updated 1 February 2026

Talking head generation is the computational synthesis of natural-looking talking videos from a single portrait image and driving speech, creating temporally consistent 3D head avatars.
It leverages explicit Gaussian splatting to decompose static head geometry from dynamic lip motions, enabling precise and controllable animation without 3D scans or facial landmarks.
Recent methods like Splat-Portrait achieve superior performance using audio-driven animation combined with per-splat control, as demonstrated by improved PSNR, SSIM, and lip-sync metrics.

Talking head generation is the computational synthesis of natural-looking talking videos from a single portrait image and driving speech, producing temporally consistent 3D head avatars with synchronized lip motion and plausible view-dependent appearance. Recent advances leverage explicit Gaussian splatting techniques to address challenges of 3D head reconstruction, disentanglement of facial motions, and requirements for minimal data and supervision. Methods such as Splat-Portrait (Shi et al., 26 Jan 2026) achieve high-quality animated avatars by directly controlling per-splat attributes, circumventing prior dependencies on neural implicit fields, facial landmarks, or multi-view datasets.

1. Problem Definition and Motivation

Talking head generation targets the accurate synthesis of facial movements and lip synchronization from speech, using one static portrait image as input. The critical objectives are: (1) reconstructing a high-fidelity, 3D-consistent mesh for the head that generalizes to novel poses; (2) animating the facial region—including accurate lip motion—in real time and in sync with spoken audio; (3) achieving these goals with no domain heuristics (such as blend-shapes or facial landmarks), and without 3D scans or multi-view setups. Prior works relying on NeRFs or 3DMMs entangle geometry and dynamics in implicit fields, leading to artifacts and view-inconsistent animation, while warping and landmark-driven models limit expressiveness and cross-identity generalization. Splat-Portrait innovates by adopting 3D Gaussian splatting, yielding both explicit controllability and separation of static geometry from dynamic facial motion (Shi et al., 26 Jan 2026).

2. Gaussian Splatting Representation

A central component is the representation of the human head as a set of anisotropic 3D Gaussian splats, each parameterized by center $\boldsymbol\mu_i \in \mathbb{R}^3$ , covariance $\Sigma_i \in \mathbb{R}^{3\times3}$ , opacity $\alpha_i$ , and RGB color $\mathbf{c}_i$ . Each splat contributes a spatial density

$\rho_i(\mathbf{x}) = \alpha_i \exp\Bigl(-\frac{1}{2} (\mathbf{x} - \boldsymbol\mu_i)^\top \Sigma_i^{-1} (\mathbf{x} - \boldsymbol\mu_i)\Bigr)$

with overall field

$\rho(\mathbf{x}) = \sum_i \rho_i(\mathbf{x}), \quad \mathbf{c}(\mathbf{x}) = \frac{\sum_i \rho_i(\mathbf{x}) \mathbf{c}_i}{\rho(\mathbf{x})}$

Rendering is performed by differentiable splat-based rasterization $\mathcal{R}$ , projecting each Gaussian's density into screen space and compositing via ordered alpha blending (Shi et al., 26 Jan 2026). This explicit decomposition enables real-time rendering and post-hoc editing of facial attributes. Splat-Portrait further separates head splats from a whole-image inpainted 2D background, allowing for occlusion and authentic background reconstruction as view changes.

3. Model Architecture and Audio-Driven Animation

The model consists of two principal networks:

Static Generator (SG): A U-Net backbone that predicts, per-pixel, Gaussian splat attributes $\{\alpha, o, s, d, \Delta_s, r, \mathbf{c}\}$ and the 2D background $B$ , given an input image $I_i$ . These parameters yield the static 3D configuration.
AudioNet/AudioAttNet: Extract frame-level speech features $A_n$ via Wav2Vec2-XLSR 53 and process them through a 1D-convolution stack, followed by attention pooling into a $256$-D audio embedding.
Dynamic Decoder: Conditioned on static SG features and fused $[\text{audio}, \Delta T]$ temporal embeddings via FiLM, the dynamic decoder outputs per-pixel dynamic offsets $\Delta_d \in \mathbb{R}^3$ to the splat positions. Lip motions at time $T_n$ are synthesized by updating splat positions: $\boldsymbol\mu_i(T_n) = \boldsymbol\mu_i^{\text{static}} + \Delta_{d,i}(T_n)$ , while other attributes remain static (Shi et al., 26 Jan 2026).

Temporal coordination utilizes either sinusoidal positional encoding or a Fourier map to embed time deltas. This architecture enables framewise lip synchronization directly from audio without motion-driven priors or landmarks.

4. Training Objectives and Data Protocols

Training is accomplished via purely image-space losses—no 3D scan, blend-shape, or landmark supervision is employed. The major losses are:

Static Reconstruction Loss:

$\mathcal{L}_{\text{static}} = \|I_i - I_i^*\|_2 + \|I_n - I_n^*\|_2 + \lambda \left(\text{LPIPS}(I_i, I_i^*) + \text{LPIPS}(I_n, I_n^*)\right)$

where $\lambda = 0.01$ and $I_i^*, I_n^*$ are renders under static splat parameters.

Dynamic Reconstruction Loss:

$\mathcal{L}_{\text{dynamic}} = \|I_i - I_i^*\|_2 + \|I_n - I_n^{**}\|_2 + \lambda \left(\text{LPIPS}(I_i, I_i^*) + \text{LPIPS}(I_n, I_n^{**})\right)$

with $I_n^{**}$ using dynamic (audio-driven) offsets.

Score-Distillation Loss (SDS): Applied to extreme camera poses $(\pm45^\circ$ yaw, $\pm12.5^\circ$ pitch), rendering a clean image $x_{\text{clean}}$ and denoised $x_{\text{denoised}}$ via the diffusion process, penalizing

$\mathcal{L}_{\text{SDS}} = \|x_{\text{clean}} - x_{\text{denoised}}\|_2$

Total loss combines reconstruction and SDS terms for static pretraining and dynamic fine-tuning stages (Shi et al., 26 Jan 2026).

Datasets include HDTF ( $\approx$ 400 clips, 350 subjects) and TalkingHead-1KH (1100 videos, 300–10,000 frames each) at $256\times256$ resolution, audio at 16 kHz, and camera parameters estimated via 3DMM fitting. Training is conducted for 200 epochs (static) and 50 epochs (dynamic) using AdamW.

5. Experimental Evaluation and Comparative Analysis

Empirical results demonstrate the superiority of Splat-Portrait on standard metrics for HDTF and TH-1KH:

Same-identity test splits:
- PSNR: $23.87$ (vs next best $23.08$)
- SSIM: $0.814$ (vs $0.786$)
- LPIPS: $0.128$ (vs $0.182$)
- CSIM (identity): $0.811$ (vs $0.753$)
- FID: $25.58$ (vs $28.63$–$37.89$)
- LipSync error: $6.328$ (vs $6.681$ for Real3D-Portrait) (Shi et al., 26 Jan 2026)
Cross-identity:
- FID: $28.62$ (best among competitors)
- CSIM: $0.726$ (best among competitors)

Qualitative results show sharper edge details, smoother depth reconstruction, plausible backgrounds revealed under head rotation, and consistent lip synch to arbitrary speech. Splat-Portrait achieves disentangled static and dynamic control with no facial landmarks or 3D scan supervision.

6. Technical Innovations and Limitations

Innovations include:

Explicit control: Per-splat attributes enable direct manipulation for animation.
Separation of static and dynamic geometry: Head structure remains stable while dynamic offsets synthesize lip motion.
Absence of domain heuristics: No blend-shapes, warping priors, or landmark supervision.
Background handling: Inpainted, view-dependent 2D backgrounds eliminate “floating head” artifacts.

Limitations are observed for extreme poses (beyond $\pm45^\circ$ yaw induces mild blurring), current dynamic decoder articulates only lip motion (not full facial expression), and static backgrounds do not encode camera or scene motion. A plausible implication is that full 4D geometry reconstruction and explicit emotion tokens could extend the framework.

7. Relation to Adjacent Methodologies and Future Directions

Relative to HumanSplat (Pan et al., 2024), which predicts Gaussian splatting representations for entire human bodies from single images using multi-view diffusion priors and latent Transformers, Splat-Portrait specializes in talking head synthesis and augments the explicit splatting approach with audio-driven animation. HumanSplat introduces a structure-aware Transformer fusing multi-view latents with SMPL mesh priors, and hierarchical semantic losses, targeting photorealistic full-body reconstruction; Splat-Portrait generalizes these principles for head animation, dispensing with geometric priors in favor of data-driven disentanglement.

Future directions suggested include:

Extending dynamic decoders for non-rigid motion in cheeks and brows
Explicit emotion conditioning
Video-driven or multi-view lifted diffusion priors
Enhanced background modeling for dynamic scenes

These advances would further expand the applicability of Gaussian splatting approaches in talking head generation and general human avatar synthesis.

Markdown Upgrade to Chat

References (2)

Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting (2026)

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Talking Head Generation.