Splat-Portrait: Audio-Driven 3D Head Animation

Updated 1 February 2026

The paper introduces Splat-Portrait, which employs explicit anisotropic Gaussian splatting and dynamic decoding to achieve photorealistic 3D talking head synthesis from a single image.
It uses a two-stage training process—with static pre-training using score-distillation and audio-conditioned fine-tuning—to decouple static geometry from dynamic lip motion.
Quantitative and qualitative results demonstrate improved metrics, accurate lip synchronization, and artifact-free rendering across extreme head poses compared to previous methods.

Splat-Portrait is an audio-driven, single-image 3D talking head synthesis framework utilizing explicit anisotropic Gaussian splatting. It circumvents prior limitations of 3D morphable models and neural radiance field-based architectures by fusing dynamic geometry editing and self-supervised learning for photorealistic, identity-preserving, and 3D-consistent animation from monocular data, without requiring landmarks, 3D scans, or facial priors (Shi et al., 26 Jan 2026).

1. Foundations and Motivation

The problem of talking head generation (THG) from a single portrait and speech signal is fundamentally ill-posed, as conventional image-to-image translation methods excel only at 2D lip synthesis and lack multi-view consistency. Prior attempts using NeRF or 3DMM representations tightly entangle static geometric structure and facial dynamics, often leading to artifacts, poor extrapolation, and artifacts in unseen viewpoints. Most critically, these methods rely on either strong parametric priors (FLAME, PNCC) or multi-view data—resources rarely available for in-the-wild portrait videos. Splat-Portrait directly addresses these deficits by:

Introducing explicit, anisotropic Gaussian splats as the geometric primitive, separable from facial dynamics and amenable to direct editing.
Learning a static 3D head representation, decoupled from a full-image 2D background layer, to enable realistic occlusion-aware rendering and inpainting for arbitrary head rotations.
Animating head geometry by modifying splat positions via an audio-conditioned dynamic decoder, rather than relying on hand-crafted priors or warping.
Employing self-supervised learning without any 3D supervision, utilizing score-distillation from a pretrained diffusion model for extreme views.

This architecture enables feed-forward, photorealistic talking head synthesis with accurate 3D geometry and robust lip motion, using only monocular data and audio inputs.

2. Gaussian-Splat Representation

The explicit 3D basis of Splat-Portrait is a set of anisotropic Gaussian splats, each parameterized by:

Center $\mu \in \mathbb{R}^3$
Covariance $\Sigma \in \mathbb{R}^{3\times 3}$ (symmetric positive-definite)
Color $c \in [0,1]^3$
Opacity/weight $w \in \mathbb{R}^+$

The spatial density contributed by a splat is:

$\rho(x) = w \cdot \exp\left(-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)\right)$

For rendering, splats are composited via volume rendering along camera rays:

The transmittance at depth $t$ ,

$T(t) = \exp\left(-\int_0^t \sigma(r(s)) ds\right)$

where $\sigma(x)$ is the sum of all splat densities at $x$ .

Pixel color is computed as:

$C = \int_0^\infty T(t)\,\sigma(r(t))\,c(r(t))\,dt$

In practical usage, the differentiable splat rasterizer (Kerbl et al., 3DGS) projects each 3D Gaussian onto a screen-space elliptical disk, composited in back-to-front order (alpha-blending).

Splat-Portrait decomposes the scene into (1) a static set of splats model the head (and hair), and (2) a 2D background image layer predicted by the network to handle regions revealed by rotation and occlusion. During pre-training, splats are encouraged to focus on the head region; the background is inpainted for regions revealed in multi-view rotations.

3. Audio-Driven Lip-Motion Synthesis

Speech-driven animation in Splat-Portrait is achieved by directly editing the geometric configuration of splats:

Audio feature extraction:
- AudioNet processes Wav2Vec2-XLSR 53 features into a 128-dimensional framewise embedding using stacked 1D convolutions (kernel size 5, channels up to 64), followed by LeakyReLU and fully connected layers.
- AudioAttNet aggregates framewise embeddings into a temporally weighted sum via attention (three 1D convolutions, LeakyReLU, linear + softmax).
Time embedding and fusion:
- Scalar frame delta $\Delta T$ (frame offset from driving audio) is mapped into a 9-frequency sinusoidal positional encoding:
$PE_k(\Delta T) = [\sin(2^k \pi \Delta T), \cos(2^k \pi \Delta T)],\quad k=0\dots8$ - These 18 dimensions concatenate with the 128-d audio embedding forming a 146-d vector, which is linearly projected for dynamic decoding.
Dynamic decoder and splat offsets:
- A U-Net, sharing skip connections with the static generator, receives splat attributes and applies convolutional FiLM-conditioned layers (affine transforms of the audio+time embedding).
- At each splat, the decoder predicts a dynamic offset $\Delta_d \in \mathbb{R}^3$ ; the splat’s position at time $T$ becomes $\mu_T = \mu + \Delta_d(A, \Delta T)$ .
- Color and covariance remain static; after applying offsets, splats are rendered for the target frame according to camera parameters.

A plausible implication is that direct dynamic editing decouples geometric and appearance dynamics and avoids artifacts associated with implicit field deformation.

4. Training Scheme and Loss Functions

Training proceeds in two stages:

Stage I (static pre-training):
- Input: two random frames from the same video ( $I_i$ , $I_n$ ) with estimated camera intrinsics/extrinsics (simple 3DMM fit, Li et al.).
- Predict splat parameters (depth, static offset, opacity, scale, rotation, color) and the 2D background image $B$ .
- Render reconstructions $I^*_i, I^*_n$ at both camera poses.
- Loss:
$\mathcal{L}_{\mathrm{static}}(I_i, I_n) = \|I_i - I^*_i\|_2 + \|I_n - I^*_n\|_2 + \lambda\,[\text{LPIPS}(I_i, I^*_i) + \text{LPIPS}(I_n, I^*_n)]$

with $\lambda = 0.01$ , LPIPS calculated via VGGFace and VGG19 backbone. - Additionally, frames are rendered over random colors and over $B$ to encourage segregation of splats and background. - Score-distillation sampling loss $\mathcal{L}_{SDS}$ (see Sec. 4.3) applied to extreme viewpoints. - Overall static loss:

$\mathcal{L}_{\mathrm{total\_static}} = \mathcal{L}_{\mathrm{static}}(I_i, I_n) + \mathcal{L}_{SDS}$
Stage II (audio-conditioned fine-tuning):
- Static decoder is frozen, dynamic decoder is added.
- Given source frame $I_i$ (static) and target frame $I_n$ with audio, render:
- $I^*_i$ with zero dynamic offset,
- $I^{**}_n$ with predicted $\Delta_d(A, \Delta T)$ offsets.
- Loss:
$\mathcal{L}_{\mathrm{dynamic}} = \|I_i - I^*_i\|_2 + \|I_n - I^{**}_n\|_2 + \lambda[\text{LPIPS}(I_i, I^*_i) + \text{LPIPS}(I_n, I^{**}_n)]$ - Again, SDS is applied for extreme viewpoints.

$\mathcal{L}_{\mathrm{total\_dynamic}} = \mathcal{L}_{\mathrm{dynamic}}(I_i, I_n) + \mathcal{L}_{SDS}$
Score-distillation sampling (SDS) loss:
- At each step, render the splat model at randomly sampled extreme pose (yaw $\pm45^\circ$ , pitch $\pm12.5^\circ$ ).
- Crop and align rendered image to the diffusion model's distribution.
- Add noise, reverse-diffuse for one/two steps to produce $x_{denoised}$ .
- SDS loss: $\mathcal{L}_{SDS} = \| x_{rendered} - x_{denoised} \|_2$
- Backpropagation is restricted to rendered splat parameters.

This suggests that photometric, perceptual, and distillation losses are tightly integrated for identity, viewpoint, and dynamic fidelity.

5. Implementation Details

Datasets used:

HDTF: $\sim400$ videos, $350+$ subjects, varied poses, cleaned backgrounds.
TalkingHead-1KH: $1,100$ identities, up to $10,000$ frames per subject, static backgrounds.

Preprocessing consists of:

Frames at $25$ Hz, audio at $16$ kHz, resizing images to $256\times256$ pixels.
Camera pose estimation via simple 3DMM fitting.

Architectural components:

Static generator: U-Net architecture following Splatter-Image (Szymanowicz et al.), predicts per-pixel splat attributes and background.
Dynamic decoder: U-Net with FiLM layers for audio/time conditioning.
AudioNet + AudioAttNet, as described.
Positional encoding for camera and time inputs.

Training schedule:

Static pre-training: $1$ million iterations, batch size $2$ pairs, AdamW with learning rate $2.5\times10^{-5}$ , weight decay $10^{-5}$ .
Fine-tuning: $200$K iterations, same optimizer, on audio-labeled portrait videos.
SDS applied at $50\%$ batch probability.

6. Experimental Results

Quantitative Performance

Same-identity comparison:

Method	PSNR↑	SSIM↑	LPIPS↓	CSIM↑	FID↓	LipSync↑
OTAvatar	13.85	0.488	0.432	0.559	78.98	5.908
NeRFFaceSpeech	13.90	0.520	0.480	0.580	64.60	4.880
HiDe-NeRF	21.44	0.685	0.221	0.716	28.63	5.552
Real3D-Portrait	22.40	0.758	0.191	0.761	35.69	6.681
GAGAvatar+ARtalker	23.08	0.786	0.182	0.753	37.89	6.580
Splat-Portrait	23.87	0.814	0.128	0.811	25.58	6.328

Cross-identity comparison:

Method	CSIM↑	FID↓	LipSync↑
NeRFFaceSpeech	0.450	50.80	4.423
OTAvatar	0.521	79.32	5.032
HiDe-NeRF	0.628	31.23	5.652
Real3D-Portrait	0.691	40.82	6.521
GAGAvatar+ARtalker	0.687	35.82	6.503
Splat-Portrait	0.726	28.62	6.218

Qualitative Analysis

Splat-Portrait preserves fine-scale geometry (wrinkles, strands, earrings) and achieves artifact-free rendering in profile and three-quarter views.
Dynamic splat offsets are temporally aligned with phoneme boundaries, yielding accurate, smooth lip articulation without explicit phoneme supervision.
2D background layer enables plausible inpainting for occluded regions revealed during head rotation, without needing masks.
Depth maps generated via explicit splats are sharper and exhibit correct 3D structure, outperforming NeRF-based methods.

7. Limitations and Future Directions

Splat-Portrait exhibits certain limitations:

Single-image shape ambiguity: Even with score-distillation, monocular inputs are fundamentally ambiguous, leading to minor artifacts on extreme or highly non-frontal poses.
Static background modeling: The background image is static per identity; moving or dynamic backgrounds are not supported.
Emotion modeling: The dynamic decoder models only lip offsets driven by audio, lacking control over broader facial expressions such as brow or cheek movements.
Camera prior dependence: Reliable camera intrinsics and extrinsics (from simplified 3DMM fit) are needed; severe calibration errors compromise reconstruction.

Potential future research avenues include:

Joint audio-video training for comprehensive expression transfer.
Extension to 4D Gaussian splats allowing full spatiotemporal facial expression modeling.
End-to-end integrated pose estimation.
Enabling the dynamic decoder to control wider expressive or stylistic features.

Splat-Portrait establishes a reference for explicit 3D point-based, self-supervised, speech-driven head synthesis, demonstrating high identity fidelity, real-time renderability, and photorealistic novel-view capability from a single image (Shi et al., 26 Jan 2026).

Markdown Upgrade to Chat

References (1)

Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Splat-Portrait.