Papers
Topics
Authors
Recent
Search
2000 character limit reached

Drivable Gaussian Avatars

Updated 3 July 2026
  • Drivable Gaussian avatars are photorealistic 3D models constructed from spatially-anchored Gaussian splats that deform with pose, expression, and motion signals.
  • They employ advanced techniques such as UV-space decoding, mesh or cage-based anchoring, and neural-field deformations to achieve faithful free-view synthesis and expression transfer.
  • These avatars enable real-time animation in VR, telepresence, and XR systems through hierarchical, resource-efficient designs and adaptive rendering pipelines.

A drivable Gaussian avatar is a photorealistic, 3D representation of a person (typically head or full-body) constructed from a set of meshed or spatially-anchored 3D Gaussian splats, whose parameters are explicitly controllable and deformable by pose, expression, or motion signals. The category includes avatars built from single images, sparse or dense multi-view video, or even synthetic priors. These avatars are engineered for faithful free-view synthesis, advanced expression transfer, and real-time animation in virtual reality, telepresence, and extended-reality systems. Key advances include hierarchical UV-space decoding, mesh or cage-based geometric anchoring, neural and hybrid CNN/analytic deformation fields, progressive and mobile-adapted codecs, and neural-field volumetric supervision. This article surveys major architectures, underlying mathematical frameworks, representative pipelines, and evaluation benchmarks in the field as substantiated in recent arXiv literature.

1. Gaussian Splatting for Avatar Representation

Modern drivable avatars parameterize human appearance and geometry by collections of 3D Gaussian primitives (“splats”), each defined by a center μiR3\mu_i\in\mathbb R^3, a covariance ΣiR3×3\Sigma_i\in\mathbb R^{3\times3} (parameterized via anisotropic scale and rotation), RGB (or spherical harmonics) color cic_i, and per-Gaussian opacity or density αi\alpha_i. The density at a point xx is Gi(x)=αiexp(12(xμi)Σi1(xμi))G_i(x)=\alpha_i \exp\left(-\frac{1}{2} (x-\mu_i)^\top \Sigma_i^{-1} (x-\mu_i)\right). Rendering projects each primitive into the camera view, accumulates their elliptical screen-space contributions, and composites them through ordered alpha blending or volumetric integration, supporting explicit depth, multi-view, and occlusion reasoning (Guo et al., 19 Apr 2025, Lee et al., 24 Dec 2025, Giebenhain et al., 2024, Qian et al., 2023, Junkawitsch et al., 21 May 2025).

Anchoring is typically geometric: Gaussians are placed at triangle barycenters, texels in a mesh-UV atlas, or off-surface in canonical volumetric fields. Methods vary in whether they use triangle-local coordinates (as in GaussianAvatars (Qian et al., 2023)), per-texel attachment (UV Gaussians (Jiang et al., 2024), SEGA (Guo et al., 19 Apr 2025), TexAvatars (Lee et al., 24 Dec 2025)), or canonical, decoupled anchor points (PiG-Avatar (Kaltheuner et al., 19 May 2026)).

2. Avatar Animation and Driving Mechanisms

Drivability refers to explicit, real-time control of avatar geometry and appearance by parametric signals. Control inputs are typically facial blendshapes and head pose (FLAME (Guo et al., 19 Apr 2025, Lee et al., 24 Dec 2025, Qian et al., 2023)), joint angles (SMPL/SMPL-X for body (Zubekhin et al., 8 Apr 2025, Dongye et al., 2024, Jiang et al., 25 Oct 2025)), or low-dimensional neural codes (NPHM (Giebenhain et al., 2024)), delivered from tracking or external drivers.

Geometric deformation frameworks include:

Fine-scale, expression-dependent geometric and photometric variations are incorporated by decoupled branches for static (identity-invariant) and dynamic (expression-driven) regions (SEGA dual-branch (Guo et al., 19 Apr 2025)), person-specific finetuning, and per-Gaussian latent codes or neural attribute fields (NPGA (Giebenhain et al., 2024), CAG-Avatar cross-attention (Chang et al., 21 Jan 2026)).

3. Network Architectures and Learning Strategies

Architectures are hybrid, typically combining:

Supervision encompasses photometric (L1, SSIM), perceptual (LPIPS, VGG), landmark, and geometry losses. For high realism, regularization includes Laplacian smoothing on offsets/features (Giebenhain et al., 2024), and physical priors (Neo-Hookean regularization in cage-based D3GA (Zielonka et al., 2023)).

Training data ranges from multi-view facial/body data to synthetic datasets of diverse virtual humans (Zielonka et al., 12 Jan 2025), with scaling to thousands of identities in GIGA (Zubekhin et al., 8 Apr 2025).

4. Hierarchical, Progressive, and Resource-Efficient Design

Level-of-detail (LOD) and progressive techniques are developed to support efficient rendering and streaming, especially for XR and mobile devices:

  • Hierarchical Gaussian hierarchies: Templates are subdivided adaptively, focusing detail (splat density) on regions with high image gradient (screen-space error) (Song et al., 17 Mar 2026, Dongye et al., 2024). Importance ranking orders the download/activation of splats for progressive refinement.
  • Selective detail enhancement: Refinement is concentrated on facial/hands regions, guided by semantic segmentation masks (Dongye et al., 2024).
  • Resource-aware distillation: Linear distillation and corrective sharing compresses neural attribute decoders to lightweight linear layers suitable for mobile hardware (e.g. SqueezeMe achieves 0.45 ms per actor decoding, enabling three full-body avatars at 72 FPS on Meta Quest 3) (Iandola et al., 2024).
  • Hierarchical LOD streaming: Coarse avatars are rendered immediately, with continuous, non-destructive integration of finer splats as network or compute allows (Song et al., 17 Mar 2026).

5. Benchmarking, Evaluation, and Comparative Analysis

Quality assessments employ standard and avatar-specific metrics:

SEGA achieves state-of-the-art on single-image head avatars (PSNR = 24.9998, SSIM = 0.8246, LPIPS = 0.2305, outperforming prior works), while methods like NPGA raise self-reenactment accuracy by >2dB PSNR over previous best (Guo et al., 19 Apr 2025, Giebenhain et al., 2024). TexAvatars and CAG-Avatar yield superior identity and detail preservation under edge-case deformations due to mesh-aware or cross-attentive driving (Lee et al., 24 Dec 2025, Chang et al., 21 Jan 2026).

6. Applications, Limitations, and Future Directions

Drivable Gaussian avatars underpin a wide range of XR/VR/Augmented Reality, telepresence, gaming, and digital content creation systems. Feed-forward pipelines (FiCA (Youwang et al., 23 Jun 2026)) and synthetic-prior few-shot tuning (Zielonka et al., 12 Jan 2025) greatly reduce latency and data requirements, fostering rapid personalization and privacy compliance.

Limitations and challenges include:

  • Generalization: Many systems rely on 3DMM priors or mesh registrations (FLAME/SMPL-X), which limit representation of non-canonical, loose clothing, complex hair, or accessories.
  • Geometry/texture binding: Surface-based attachment can tether detail to mesh topology—decoupled canonical/volumetric anchors (as in PiG-Avatar) address some, but not all, of these issues (Kaltheuner et al., 19 May 2026).
  • Real-time, multi-actor scalability: Continued work is needed to enable full-scene, high-fidelity, simultaneous multi-avatar rendering at interactive rates on untethered/mobile hardware (Iandola et al., 2024, Song et al., 17 Mar 2026).
  • Semantic controls and relighting: Most avatars encode radiance fields fixed to training illumination; robust relighting and physically-based appearance modeling remain open areas.
  • Dataset bias and fairness: Generalization to diverse populations, ages, and conditions depends on the breadth of multi-view and synthetic training data.

Anticipated directions include integration of physics-based dynamics (for hair/clothes), richer statistical priors, global LOD mesh simplification and splat culling, and hybrid mesh–Gaussian architectures for compositional, semantically-driven animation.

7. Representative Pipelines and Comparative Table

The following table summarizes several key approaches and their principal design elements:

Approach Geometry Binding Gaussian Prediction Deformation Control Scalability/Inference FPS
SEGA (Guo et al., 19 Apr 2025) FLAME UV Dual-branch CNN Identity/expression latent 20 FPS (A100), SOTA generalization
TexAvatars (Lee et al., 24 Dec 2025) FLAME, Quasi-Phong Jacobians CNN (UV), mesh Jacobian Analytic rig + local CNN 50 FPS (3090Ti)
CAG-Avatar (Chang et al., 21 Jan 2026) FLAME UV Cross-attention, Per-Gaussian Per-splat cross-attention >50 FPS (4090)
GIGA (Zubekhin et al., 8 Apr 2025) SMPL-X (UV) MultiHeadUNet Motion code, pose --
NPGA (Giebenhain et al., 2024) NPHM neural field MLP (per-Gaussian) Cycle-consistent MLPs 31–43 FPS (3080)
ProgressiveAvatars (Song et al., 17 Mar 2026) FLAME face-local Implicit subdivision Hierarchy, importance rank 52–159 FPS (5090), progressive
SqueezeMe (Iandola et al., 2024) LBS, UV Linear, grid-shared Linear corrective, GCS 0.45 ms HTP, 72 FPS (Quest 3)
FiCA (Youwang et al., 23 Jun 2026) UPM hypernetwork Feed-forward mesh/diff Universal prior, single-image 66 FPS (A100), full pipeline ~5 s

Significant improvements since 2023-2026 include (1) UV-space and hybrid mesh+CNN architectures, (2) real-time progressive loading, (3) substantial reductions in resource demands through linear distillation and shared correctives, and (4) state-of-the-art detail and generalization in single-image and few-shot settings.


The field continues rapid innovation in geometry/UV representation, learning paradigms, and real-time adaptation, converging toward robust, visually faithful, universally drivable Gaussian avatars.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Drivable Gaussian Avatars.