3D Avatar Gaussian Splatting

Updated 20 April 2026

The paper introduces a network-free canonical avatar constructed from anisotropic 3D Gaussians for compact, photorealistic rendering.
It employs SMPL-based pose encoding and linear blend skinning to efficiently animate avatars while decoupling fixed appearance from motion.
The framework achieves real-time rendering with exceptional rate-distortion efficiency, enabling immersive streaming with low bitrate.

3D Avatar Gaussian Splatting is a computational paradigm for representing, rendering, animating, and compressing photorealistic human avatars based on explicit 3D Gaussian primitives. Extending real-time Gaussian Splatting from static or scene-based radiance field rendering to dynamic, animatable human body and face avatars, these frameworks exploit geometric priors (e.g., SMPL) and disentangle appearance from motion. By integrating explicit statistical shape models with explicit volumetric splatting, these methods provide real-time, high-fidelity novel-view/image synthesis, pose-driven animation, and exceptional rate-distortion efficiency for interactive and streaming applications.

1. Canonical Gaussian Avatar Construction

In the leading frameworks, the foundation is a canonical, “network-free” avatar constructed as a set of $N$ anisotropic 3D Gaussian primitives:

$G = \{g_1, g_2, \dotsc, g_N\}$

where each Gaussian $g_i$ is parameterized by its mean position $\mu_i \in \mathbb{R}^3$ , covariance $\Sigma_i \in \mathbb{R}^{3\times3}$ (encoding anisotropic size and orientation), spherical-harmonic color coefficients $c_i$ (typically $C=9$ for low-frequency lighting), and opacity $\alpha_i \in [0,1]$ (Yin et al., 12 Oct 2025). For animatable body avatars, initial placements correspond to vertex positions of a star-shaped, outstretched-pose SMPL mesh. Joint optimization over all views and frames in the training set aligns Gaussians to multi-view imagery, using a loss:

$L = \sum_{t,v} \Bigl\| \hat{I}_t^v - I_t^v \Bigr\|_1 + \lambda_1 \Bigl\| A_t^v - m_t^v \Bigr\|_2 + \lambda_2 (1-\mathrm{SSIM}(\hat{I}_t^v, I_t^v)) + \lambda_3 \mathrm{LPIPS}(\hat{I}_t^v, I_t^v)$

where $\hat{I}_t^v$ and $G = \{g_1, g_2, \dotsc, g_N\}$ 0 are rendered color and opacity, $G = \{g_1, g_2, \dotsc, g_N\}$ 1 and $G = \{g_1, g_2, \dotsc, g_N\}$ 2 are ground-truth image and binary mask, and $G = \{g_1, g_2, \dotsc, g_N\}$ 3 are balancing weights, e.g., $G = \{g_1, g_2, \dotsc, g_N\}$ 4 (Yin et al., 12 Oct 2025). Rendering is via forward splatting and alpha compositing. This canonical representation serves as a stable, highly compact appearance model for subsequent deformation.

2. Avatar Animation via Geometric Priors and Skinning

Animatable avatars leverage a geometric prior such as the SMPL statistical body model, with the prior mapping shape ( $G = \{g_1, g_2, \dotsc, g_N\}$ 5) and pose ( $G = \{g_1, g_2, \dotsc, g_N\}$ 6) parameters to mesh vertices and skeleton joints. For each animation frame $G = \{g_1, g_2, \dotsc, g_N\}$ 7, only 94 parameters—a 72D joint rotation (axis-angle), 10D shape, 3×3 global rotation, and 3D translation—are needed:

$G = \{g_1, g_2, \dotsc, g_N\}$ 8

This compact pose/shape code is entropy encoded per-frame, decoupling temporal motion from fixed appearance and minimizing redundancy (Yin et al., 12 Oct 2025). At decode time, each Gaussian center and covariance undergoes a canonical-to-target deformation via Linear Blend Skinning (LBS): \begin{align*} A_i &= \sum_k w_{i,k} A_k \ b_i &= \sum_k w_{i,k} b_k \ \mu_i^t &= A_i \mu_i + b_i \ \bar{\mu}_i^t &= \mu_i^t R_t^\top + T_t \ \Sigma_i^t &= A_i \Sigma_i A_i^\top \end{align*} where $G = \{g_1, g_2, \dotsc, g_N\}$ 9 are per-Gaussian skinning weights, $g_i$ 0 are joint’s transform, and $g_i$ 1 global pose. The resulting avatar’s geometry and appearance can be temporally and view coherently animated by deforming the canonical Gaussians via LBS, with fixed appearance (color and opacity).

3. Rendering Pipeline and Loss Formulation

Rendering proceeds by projecting each anisotropic Gaussian onto the image plane. After deformation, each primitive’s ellipsoidal footprint is rasterized using forward splatting; colors are modulated, possibly by spherical harmonics for view dependence; and contributions are composited along the camera ray in front-to-back order with alpha blending (Yin et al., 12 Oct 2025). The compositing formula is

$g_i$ 2

where each Gaussian’s screen-space density and color determines its pixel contribution. Direct splatting eschews the need for costly ray marching. Multi-view photometric, structure (SSIM), and perceptual (LPIPS) losses are jointly optimized for both canonical construction and fine-tuning. The pipeline allows for extremely efficient, GPU-accelerated, parallelized rendering and training (Jung et al., 2023).

4. Compression and Rate-Distortion Analysis

The prior-guided framework exploits the fact that the canonical avatar is frame-invariant and only needs to be compressed once (quantized and entropy coded via, for example, MPEG GeS-TM). Temporal parameters $g_i$ 3 require minimal bandwidth (94 floats per frame quantized and CABAC-coded); per-frame runtime bit-rate is

$g_i$ 4

while the total rate is $g_i$ 5 (Yin et al., 12 Oct 2025). Explicit rate–distortion optimization allows control of reconstruction quality as a function of bit-rate:

$g_i$ 6

On ZJU-MoCap and MonoCap, prior-guided 3DGS achieves ultra-low bitrates (≤0.2 Mbps/0.26 Mbps), with PSNR ≈ 33–34 dB and SSIM > 0.95—far surpassing mesh-based, point-cloud, or learned codecs (e.g., CompactSTG) by both objective and subjective criteria (Yin et al., 12 Oct 2025). Subjective evaluation confirms preservation of fine details, facial features, and limb silhouettes even at extreme rate savings.

5. Extensions and Architectural Variants

While the prior-guided approach is network-free and extremely efficient for streaming or VR applications, related work augments or alters this template for further expressivity or application domains:

Dynamic non-rigid refinement: Approaches such as ParDy-Human interpose a per-Gaussian learned residual MLP, enabling local, pose-dependent deformations beyond pure LBS, effectively capturing fine nonrigid effects (e.g., cloth wrinkles) and improving novel-pose generalization (Jung et al., 2023).
Integration with crowd and interactive systems: CrowdSplat incorporates the same 3DGS/LBS pipeline with multi-level LoD selection, SoA GPU structures, and shared attribute buffers to enable memory- and bandwidth-efficient rendering of thousands of animated avatars in real time (Sun et al., 29 Jan 2025).
Fine-grained detail and head avatars: Specialized methods (e.g., HyperGaussians (Serifi et al., 3 Jul 2025)) and mouth-adaptive splatting (GeoAvatar (Moon et al., 24 Jul 2025)) further refine details for expressive faces, augmenting the 3D Gaussian representation with high-dimensional latent codes or mouth-specific deformation submodules.

6. Impact, Limitations, and Application Domains

Prior-guided, 3DGS-based avatar splatting achieves a unique combination of ultra-low bandwidth, real-time Alexa-level performance, state-of-the-art perceptual visual quality for human bodies and faces, and stability for novel-pose and novel-view applications (Yin et al., 12 Oct 2025, Sun et al., 29 Jan 2025). These properties directly enable real-time immersive streaming for metaverse, VR/AR, and social-telepresence applications, as well as efficient archiving and transmission of multi-view human video content.

Notwithstanding, the explicit separation of geometry and appearance limits expressivity for highly non-rigid deformations where residual refinement or generative priors may be required. Moreover, the practical rendering and optimization efficiency depends on tight hardware/codec integration.

7. Comparative Table: Prior-Guided 3D Avatar Splatting

Feature	Prior-Guided 3DGS (Yin et al., 12 Oct 2025)	Dynamic Refinement (ParDy-Human (Jung et al., 2023))	Crowd Splatting (Sun et al., 29 Jan 2025)
Canonical Avatar	Yes, shared for all frames	Yes, with per-Gaussian residuals	Yes
Animation	SMPL + LBS (94 params/frame)	SMPL + residual MLP	SMPL + LBS
Compression	Single G + per-frame SMPL code	-	-
Rate-Distortion	0.2–0.26 Mbps, PSNR 33–34 dB	-	LPIPS ~0.12–0.19, PSNR ~27–31 dB
Supported Output	Full-body, limbs, face detail	Dynamic cloth, novel pose	Crowds (>3k avatars in real time)
Rendering Speed	Real-time (tens of ms)	~9 fps (CPU, full res)	31 fps (3.5k avatars @ 1280x720, RTX4090)

These platforms define the state of the art for real-time, high-fidelity, scalable, and bandwidth-efficient digital human representation using 3D Avatar Gaussian Splatting.