Dynamic Gaussian Avatars

Updated 17 April 2026

Dynamic Gaussian Avatars are explicit 3D models that use anisotropic Gaussian primitives defined by position, covariance, opacity, and view-dependent features to render digital humans.
They combine pose-guided skinning with nonrigid MLP-driven deformations to achieve high-fidelity, real-time animation and novel-view synthesis.
Their training leverages photometric, perceptual, and geometric regularizations, surpassing NeRF-based methods in rendering sharpness and computational efficiency.

Dynamic Gaussian Avatars are a class of explicit 3D representations that employ collections of anisotropic Gaussian primitives—parametrized by position, shape, orientation, opacity, and often view-dependent appearance—to model, animate, and render photorealistic digital humans and human heads in motion. By coupling these primitives to pose-parametric models (e.g., SMPL, FLAME) and/or local deformation fields, Dynamic Gaussian Avatar methods achieve real-time, high-fidelity, and physically plausible animation, novel-view synthesis, and efficient storage or streaming. They have demonstrably surpassed neural radiance field (NeRF)–based avatars in rendering efficiency and reconstruction sharpness, and now underpin state-of-the-art digital human pipelines.

1. Gaussian Splatting Representation and Animation

Dynamic Gaussian Avatars are built from a set of 3D anisotropic Gaussian functions, each described by at minimum a center $\mu_i\in\mathbb{R}^3$ , a covariance $\Sigma_i\in\mathbb{R}^{3\times3}$ (often factorized into a scale vector $s_i$ and rotation $R_i$ ), an opacity $\alpha_i\in[0,1]$ , and a vector of appearance parameters (e.g., spherical harmonics coefficients $f_i$ for view-dependent color) (Li et al., 24 Feb 2025). The density for a single primitive is

$\phi_i(x) = \alpha_i\,\exp(-[x-\mu_i]^\top\Sigma_i^{-1}[x-\mu_i]).$

An avatar is the union of such Gaussians; surface and appearance details are encoded explicitly through their spatial and visual parameters.

Animation of these avatars is achieved by deforming the canonical (often “T-pose”) Gaussians into posed space using skinning or more general deformation fields. The deformation can consist of rigid transformations via Linear Blend Skinning (LBS) with SMPL/FLAME (i.e., using joint transformations and per-Gaussian skinning weights), as well as additional nonrigid corrections predicted by local multi-layer perceptrons (MLPs) conditioned on pose and/or local appearance (Li et al., 24 Feb 2025, Chen et al., 2024, Hu et al., 2023).

2. Deformation Frameworks and Pose Control

A central challenge is realistically mapping from canonical (template) pose to arbitrary target poses or expressions for each Gaussian. Frameworks vary by their decomposition:

Pose-Guided Deformation: Each Gaussian is first nonrigidly adjusted via an MLP that takes both the canonical position and its transformed proxy position from the pose-parametric mesh and outputs offsets to center, rotation, and scale; this is followed by LBS skinning driven by nearest mesh vertices and adjacent joints (Li et al., 24 Feb 2025, Chen et al., 2024).
Mesh-Aligned Coordination: In SAGA, Gaussians are either strictly bound to mesh faces by barycentric coordinates (stage 1 “adhered”) or allowed to detach with soft regularization (stage 2 “detached”), enabling the model to balance geometric fidelity with expressive power (Chen et al., 2024).
Per-Gaussian Latent Codes: Expressive avatars such as NPGA assign each Gaussian a small learnable code, conditioning both forward deformation and residual MLP corrections; regularization (graph Laplacians) is required to prevent overly local “drift” (Giebenhain et al., 2024).
Patch-based or Hierarchical Parameterizations: In ScaffoldAvatar, Gaussian dynamics are synthesized from local patch expressions (via a geometric patch blendshape model), with patch-level MLPs translating local motion codes into anchor-based Gaussian deformations (Aneja et al., 14 Jul 2025).

These frameworks ensure that dynamics are both surface-consistent (matching anatomical deformations) and capable of reconstructing high-frequency appearance (cloth folding, facial micro-expressions).

3. Optimization Objectives, Regularization, and Densification

Avatar training is based on photometric and structural supervision involving multiple loss terms:

Photometric Loss: Mixture of $L_1$ and SSIM between rendered and ground-truth RGB images (Li et al., 24 Feb 2025, Dongye et al., 2024).
Perceptual Loss: LPIPS on rendered versus ground-truth images is often included for perceptual sharpness and realism (Li et al., 24 Feb 2025, Chen et al., 2024, Lee et al., 30 Mar 2026).
Geometric Regularization: Local-isometry or Laplacian losses to maintain smoothness and coherence across deformed Gaussians (Li et al., 24 Feb 2025, Giebenhain et al., 2024, Chen et al., 2024).
Positional and Scale Constraints: Explicit geometric terms penalizing Gaussians that drift too far from mesh surfaces or grow too large (Qian et al., 2023).
Specialized Regularizers: In patch-driven or per-latent code approaches, region- or code-sparsity regularization prevents oversaturation (Wang et al., 21 Apr 2025).

In high-fidelity avatars, surface details are further enhanced by densification procedures: back-propagated image-gradients from rendering identify the Gaussians covering high-error regions, which are then adaptively split (reducing their scale and covering more samples) (Li et al., 24 Feb 2025, Li et al., 8 May 2025). Selective densification focused on semantically important regions (e.g., face, hands) is critical for balancing frame rate with detail (Dongye et al., 2024).

4. Streaming, Compression, and Level-of-Detail Control

Efficient rendering and transmission are addressed through hierarchical and decomposed representations:

Hierarchical Levels of Detail (LoD): LoDAvatar constructs avatars as sequences of increasingly fine Gaussian sets, allowing rendering at coarse or fine levels depending on runtime constraints; selective per-region refinement further optimizes resource use (Dongye et al., 2024).
Layer-Wise Compression: HGC-Avatar disentangles avatar encoding into motion (SMPL-X) and structure (network predicting Gaussians from pose maps), enabling layer-wise compression and progressive decoding (Tang et al., 18 Oct 2025).
Tensorial and Latent Factorizations: Compact tensorial designs store static appearance in tri-planes and dynamic appearance in 1D feature lines, significantly reducing RAM and storage requirements while supporting real-time animation (Wang et al., 21 Apr 2025).

Quantitative metrics show that adaptive LoD yields high-quality renderings at 80–10 FPS as Gaussian count is varied from 75k to >1M, with subjective evaluations indicating high realism maintained at closest viewing ranges (Dongye et al., 2024).

5. Applications, Extensions, and Empirical Benchmarks

Dynamic Gaussian Avatars are used for:

Facial and Full-Body Animation: Enabling photorealistic reenactment, cross-identity transfer, and expressive animation, supporting arbitrary input poses and expressions tracked from monocular or multi-view input (Ji et al., 20 Jan 2026, Teotia et al., 2024, Lee et al., 30 Mar 2026).
Efficient Streaming/Cloud Rendering: Models such as HGC-Avatar can transmit compressed, low-bitrate Gaussian avatars for real-time streaming on edge devices, with up to 100× compression versus generic 3DGS (Tang et al., 18 Oct 2025).
Physically-based Hair Dynamics: Extended hybrid models attach Gaussians to simulated hair strands as well as head mesh, supporting physically plausible hair motion, strand-level editing, and domain-specific color transfer (Kabadayi et al., 7 Apr 2026).
Ultra-High-Resolution Telepresence: ScaffoldAvatar achieves high-fidelity, real-time avatars at 3K image resolution with photorealistic microfeatures, such as wrinkles and pores, leveraging patch expressions and color-based densification (Aneja et al., 14 Jul 2025).

Empirical comparisons consistently show that Dynamic Gaussian Avatar methods match or exceed NeRF-driven avatars in PSNR/SSIM/LPIPS at a fraction of the training or inference time (Li et al., 24 Feb 2025, Hu et al., 2023, Jiang et al., 25 Oct 2025). State-of-the-art frameworks achieve >30 dB PSNR, >0.95 SSIM, and LPIPS ≤ 0.04 on standard benchmarks, with real-time (30–120 FPS) rendering and sub-hour training for most models (Li et al., 24 Feb 2025, Chen et al., 2024, Lee et al., 30 Mar 2026).

6. Limitations and Outlook

Several challenges remain for Dynamic Gaussian Avatars:

Reliance on the fidelity of mesh fitting (e.g., SMPL/FLAME errors propagate to surface alignment and pose tracking).
Model generalization in the presence of loose clothing, occlusions, or nonrigid extreme motions (e.g., cloth/ear flapping) is not fully solved (Chen et al., 2024, Jiang et al., 25 Oct 2025).
Although frameworks support multi-human or freeform scenes (Liu et al., 2023), most current pipelines remain single-subject.
Hardware and memory requirements scale with Gaussian count; memory-efficient or mobile-focused variants are a target of ongoing research (Tang et al., 18 Oct 2025, Wang et al., 21 Apr 2025).

Continued innovation in Gaussian-based avatar modeling is likely to further close the gap to real-time, high-fidelity, and widely accessible digital human representations for AR/VR, telepresence, and entertainment applications.

References: