Drivable Gaussian Avatars
- Drivable Gaussian avatars are photorealistic 3D models constructed from spatially-anchored Gaussian splats that deform with pose, expression, and motion signals.
- They employ advanced techniques such as UV-space decoding, mesh or cage-based anchoring, and neural-field deformations to achieve faithful free-view synthesis and expression transfer.
- These avatars enable real-time animation in VR, telepresence, and XR systems through hierarchical, resource-efficient designs and adaptive rendering pipelines.
A drivable Gaussian avatar is a photorealistic, 3D representation of a person (typically head or full-body) constructed from a set of meshed or spatially-anchored 3D Gaussian splats, whose parameters are explicitly controllable and deformable by pose, expression, or motion signals. The category includes avatars built from single images, sparse or dense multi-view video, or even synthetic priors. These avatars are engineered for faithful free-view synthesis, advanced expression transfer, and real-time animation in virtual reality, telepresence, and extended-reality systems. Key advances include hierarchical UV-space decoding, mesh or cage-based geometric anchoring, neural and hybrid CNN/analytic deformation fields, progressive and mobile-adapted codecs, and neural-field volumetric supervision. This article surveys major architectures, underlying mathematical frameworks, representative pipelines, and evaluation benchmarks in the field as substantiated in recent arXiv literature.
1. Gaussian Splatting for Avatar Representation
Modern drivable avatars parameterize human appearance and geometry by collections of 3D Gaussian primitives (“splats”), each defined by a center , a covariance (parameterized via anisotropic scale and rotation), RGB (or spherical harmonics) color , and per-Gaussian opacity or density . The density at a point is . Rendering projects each primitive into the camera view, accumulates their elliptical screen-space contributions, and composites them through ordered alpha blending or volumetric integration, supporting explicit depth, multi-view, and occlusion reasoning (Guo et al., 19 Apr 2025, Lee et al., 24 Dec 2025, Giebenhain et al., 2024, Qian et al., 2023, Junkawitsch et al., 21 May 2025).
Anchoring is typically geometric: Gaussians are placed at triangle barycenters, texels in a mesh-UV atlas, or off-surface in canonical volumetric fields. Methods vary in whether they use triangle-local coordinates (as in GaussianAvatars (Qian et al., 2023)), per-texel attachment (UV Gaussians (Jiang et al., 2024), SEGA (Guo et al., 19 Apr 2025), TexAvatars (Lee et al., 24 Dec 2025)), or canonical, decoupled anchor points (PiG-Avatar (Kaltheuner et al., 19 May 2026)).
2. Avatar Animation and Driving Mechanisms
Drivability refers to explicit, real-time control of avatar geometry and appearance by parametric signals. Control inputs are typically facial blendshapes and head pose (FLAME (Guo et al., 19 Apr 2025, Lee et al., 24 Dec 2025, Qian et al., 2023)), joint angles (SMPL/SMPL-X for body (Zubekhin et al., 8 Apr 2025, Dongye et al., 2024, Jiang et al., 25 Oct 2025)), or low-dimensional neural codes (NPHM (Giebenhain et al., 2024)), delivered from tracking or external drivers.
Geometric deformation frameworks include:
- Analytic mesh rigging: Gaussians ride on mesh triangles whose positions are given by parametric morphable models or skeleton-based skinning (Qian et al., 2023, Jiang et al., 2024, Dongye et al., 2024, Zubekhin et al., 8 Apr 2025).
- UV-to-3D hybrid lifting: Predict local attributes per texel (CNNs in UV space), then use mesh-aware Jacobians or barycentric frames to transfer to global 3D, supporting smoothness and semantic continuity (Lee et al., 24 Dec 2025, Guo et al., 19 Apr 2025).
- Forward neural fields: Independently learn canonical-to-posed deformations by distilling neural parametric models into MLPs with cycle-consistency constraints (Giebenhain et al., 2024).
- Volumetric barycentric transport: Canonical anchors are transported by barycentric projection and local frame alignment to handle off-surface, layered, or loose clothing geometries, as in PiG-Avatar (Kaltheuner et al., 19 May 2026).
Fine-scale, expression-dependent geometric and photometric variations are incorporated by decoupled branches for static (identity-invariant) and dynamic (expression-driven) regions (SEGA dual-branch (Guo et al., 19 Apr 2025)), person-specific finetuning, and per-Gaussian latent codes or neural attribute fields (NPGA (Giebenhain et al., 2024), CAG-Avatar cross-attention (Chang et al., 21 Jan 2026)).
3. Network Architectures and Learning Strategies
Architectures are hybrid, typically combining:
- UV-space 2D CNNs or U-Nets: Predict per-texel splat parameters conditioned on identity, pose, and expression (Guo et al., 19 Apr 2025, Lee et al., 24 Dec 2025, Jiang et al., 2024).
- Multi-branch designs: Separate decoders for static and dynamic regions, as in SEGA's static and dynamic branches (Guo et al., 19 Apr 2025), or EVA’s decoupled body/head U-Nets (Junkawitsch et al., 21 May 2025).
- Latent code integration: Person-specific or per-part latent codes (ID and expression, part-aware upsampling (Zielonka et al., 12 Jan 2025)), local per-splat features (Giebenhain et al., 2024), and learnable appearance latents driven by spatial MLPs with autoregressive predictors (Steiner et al., 1 Apr 2026).
- Neural fields: Multi-resolution hash grids or triplanes as continuous volumetric feature fields, supplying appearance and offset information to Gaussian anchoring points (Kaltheuner et al., 19 May 2026, Giebenhain et al., 2024, Yuan et al., 2023).
- Cross-attention modules: Conditionally Adaptive Gaussian Avatars employ cross-attention for per-region driving signal selection, improving local detail reproduction (Chang et al., 21 Jan 2026).
Supervision encompasses photometric (L1, SSIM), perceptual (LPIPS, VGG), landmark, and geometry losses. For high realism, regularization includes Laplacian smoothing on offsets/features (Giebenhain et al., 2024), and physical priors (Neo-Hookean regularization in cage-based D3GA (Zielonka et al., 2023)).
Training data ranges from multi-view facial/body data to synthetic datasets of diverse virtual humans (Zielonka et al., 12 Jan 2025), with scaling to thousands of identities in GIGA (Zubekhin et al., 8 Apr 2025).
4. Hierarchical, Progressive, and Resource-Efficient Design
Level-of-detail (LOD) and progressive techniques are developed to support efficient rendering and streaming, especially for XR and mobile devices:
- Hierarchical Gaussian hierarchies: Templates are subdivided adaptively, focusing detail (splat density) on regions with high image gradient (screen-space error) (Song et al., 17 Mar 2026, Dongye et al., 2024). Importance ranking orders the download/activation of splats for progressive refinement.
- Selective detail enhancement: Refinement is concentrated on facial/hands regions, guided by semantic segmentation masks (Dongye et al., 2024).
- Resource-aware distillation: Linear distillation and corrective sharing compresses neural attribute decoders to lightweight linear layers suitable for mobile hardware (e.g. SqueezeMe achieves 0.45 ms per actor decoding, enabling three full-body avatars at 72 FPS on Meta Quest 3) (Iandola et al., 2024).
- Hierarchical LOD streaming: Coarse avatars are rendered immediately, with continuous, non-destructive integration of finer splats as network or compute allows (Song et al., 17 Mar 2026).
5. Benchmarking, Evaluation, and Comparative Analysis
Quality assessments employ standard and avatar-specific metrics:
- Photometric quality: Peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and perceptual LPIPS, measured on held-out views and unseen identities/expressions (Guo et al., 19 Apr 2025, Giebenhain et al., 2024, Junkawitsch et al., 21 May 2025, Lee et al., 24 Dec 2025, Chang et al., 21 Jan 2026, Zubekhin et al., 8 Apr 2025).
- Generalization: Ability to reenact novel poses, expressions, or transfer expressions across identities (Zielonka et al., 12 Jan 2025, Teotia et al., 2024, Giebenhain et al., 2024).
- Resource efficiency: Frames per second for high-resolution output, and memory/HTP time for mobile or standalone VR (Iandola et al., 2024, Song et al., 17 Mar 2026).
- Qualitative analysis: Multi-identity, multi-expression, and in-the-wild datasets for visual consistency, facial correspondence, and semantic detail (hair, mouth interior, wrinkles).
SEGA achieves state-of-the-art on single-image head avatars (PSNR = 24.9998, SSIM = 0.8246, LPIPS = 0.2305, outperforming prior works), while methods like NPGA raise self-reenactment accuracy by >2dB PSNR over previous best (Guo et al., 19 Apr 2025, Giebenhain et al., 2024). TexAvatars and CAG-Avatar yield superior identity and detail preservation under edge-case deformations due to mesh-aware or cross-attentive driving (Lee et al., 24 Dec 2025, Chang et al., 21 Jan 2026).
6. Applications, Limitations, and Future Directions
Drivable Gaussian avatars underpin a wide range of XR/VR/Augmented Reality, telepresence, gaming, and digital content creation systems. Feed-forward pipelines (FiCA (Youwang et al., 23 Jun 2026)) and synthetic-prior few-shot tuning (Zielonka et al., 12 Jan 2025) greatly reduce latency and data requirements, fostering rapid personalization and privacy compliance.
Limitations and challenges include:
- Generalization: Many systems rely on 3DMM priors or mesh registrations (FLAME/SMPL-X), which limit representation of non-canonical, loose clothing, complex hair, or accessories.
- Geometry/texture binding: Surface-based attachment can tether detail to mesh topology—decoupled canonical/volumetric anchors (as in PiG-Avatar) address some, but not all, of these issues (Kaltheuner et al., 19 May 2026).
- Real-time, multi-actor scalability: Continued work is needed to enable full-scene, high-fidelity, simultaneous multi-avatar rendering at interactive rates on untethered/mobile hardware (Iandola et al., 2024, Song et al., 17 Mar 2026).
- Semantic controls and relighting: Most avatars encode radiance fields fixed to training illumination; robust relighting and physically-based appearance modeling remain open areas.
- Dataset bias and fairness: Generalization to diverse populations, ages, and conditions depends on the breadth of multi-view and synthetic training data.
Anticipated directions include integration of physics-based dynamics (for hair/clothes), richer statistical priors, global LOD mesh simplification and splat culling, and hybrid mesh–Gaussian architectures for compositional, semantically-driven animation.
7. Representative Pipelines and Comparative Table
The following table summarizes several key approaches and their principal design elements:
| Approach | Geometry Binding | Gaussian Prediction | Deformation Control | Scalability/Inference FPS |
|---|---|---|---|---|
| SEGA (Guo et al., 19 Apr 2025) | FLAME UV | Dual-branch CNN | Identity/expression latent | 20 FPS (A100), SOTA generalization |
| TexAvatars (Lee et al., 24 Dec 2025) | FLAME, Quasi-Phong Jacobians | CNN (UV), mesh Jacobian | Analytic rig + local CNN | 50 FPS (3090Ti) |
| CAG-Avatar (Chang et al., 21 Jan 2026) | FLAME UV | Cross-attention, Per-Gaussian | Per-splat cross-attention | >50 FPS (4090) |
| GIGA (Zubekhin et al., 8 Apr 2025) | SMPL-X (UV) | MultiHeadUNet | Motion code, pose | -- |
| NPGA (Giebenhain et al., 2024) | NPHM neural field | MLP (per-Gaussian) | Cycle-consistent MLPs | 31–43 FPS (3080) |
| ProgressiveAvatars (Song et al., 17 Mar 2026) | FLAME face-local | Implicit subdivision | Hierarchy, importance rank | 52–159 FPS (5090), progressive |
| SqueezeMe (Iandola et al., 2024) | LBS, UV | Linear, grid-shared | Linear corrective, GCS | 0.45 ms HTP, 72 FPS (Quest 3) |
| FiCA (Youwang et al., 23 Jun 2026) | UPM hypernetwork | Feed-forward mesh/diff | Universal prior, single-image | 66 FPS (A100), full pipeline ~5 s |
Significant improvements since 2023-2026 include (1) UV-space and hybrid mesh+CNN architectures, (2) real-time progressive loading, (3) substantial reductions in resource demands through linear distillation and shared correctives, and (4) state-of-the-art detail and generalization in single-image and few-shot settings.
The field continues rapid innovation in geometry/UV representation, learning paradigms, and real-time adaptation, converging toward robust, visually faithful, universally drivable Gaussian avatars.