Canonical Gaussian Avatars

Updated 19 October 2025

Canonical Gaussian avatars are digital human representations defined by explicit 3D Gaussian primitives anchored to standardized templates like SMPL-X.
They employ Gaussian splatting and dual 2D map parameterization to enable real-time rendering, dynamic pose animation, and efficient garment decomposition.
This framework supports high-fidelity relighting and robust performance, making it ideal for AR/VR, telepresence, virtual production, and related applications.

A canonical Gaussian avatar is a digital human representation constructed as an explicit set of 3D Gaussian primitives, parameterized and organized in a standardized "canonical" pose space. This approach leverages recent advances in Gaussian Splatting—where each avatar is encoded as a collection of spatially and photometrically parameterized Gaussians—enabling fast, animatable, and relightable digital humans that outperform implicit models in fidelity, efficiency, and editability. The canonical space typically aligns with a character-specific template (such as SMPL-X or FLAME mesh) and allows robust mapping between avatar geometry/appearance and animation parameters. The framework extends to garment decomposition, multi-modal rendering, and physical relighting, and is the foundation for contemporary research converting image or video observations into high-fidelity, controllable avatars.

1. Canonical Gaussian Representation

Canonical Gaussian avatars are defined in a reference pose and parameterized as a set of explicit 3D Gaussians, each described by its mean $\mu$ , covariance matrix $\Sigma$ , color $c$ , and opacity $\alpha$ . The overall density field is modeled as

$f(x | \mu, \Sigma) = \exp\left[-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right]$

with rendering performed via ordered splatting and alpha compositing. In practice, avatars are anchored to a learned parametric template representing both body and clothing; SMPL-X or FLAME meshes are often used as priors for template extraction, ensuring anatomical coherence and acting as skeletons for downstream animation. The template is extracted (e.g., via marching cubes on an SDF learned from multi-view videos) and imbued with skinning weights to enable articulated deformation.

2. Template Parameterization and 2D Gaussian Maps

To facilitate efficient appearance modeling and leverage 2D convolutional architectures, canonical avatar geometry is "unwrapped" into one or more canonical Gaussian maps. Common practice projects the template into front and back 2D maps, assigning each pixel a corresponding 3D Gaussian parameter. This mapping guarantees dense coverage—even for regions occluded in a single view—and enables later use of StyleGAN-based CNNs for detailed, pose-conditioned regression. The dual-map parameterization also permits garment decomposition, with layers representing separate body parts or garments, as in LAGA (Gong et al., 21 May 2024) and FMGS-Avatar (Fan et al., 18 Sep 2025).

Method	Template Basis	2D Map Unwrapping	Layered/Decomposed
Animatable Gaussians (Li et al., 2023)	SMPL-X mesh	Dual (front/back)	No
LAGA (Gong et al., 21 May 2024)	SMPL-X mesh	Layered per garment	Yes
FMGS-Avatar (Fan et al., 18 Sep 2025)	Upsampled SMPL mesh	Per-face barycenter	Yes

The explicit mapping to canonical 2D maps ensures that every animation or style edit can be performed as efficient 2D convolutions, with 3D positions regenerated by inverse projection.

3. Animation via Skinning and Pose Projection

Animation of canonical Gaussian avatars is realized by deforming each Gaussian's position, orientation, and scale according to underlying skeletal pose parameters. Linear Blend Skinning (LBS) is the standard method: for each Gaussian, the deformation is

$x_t = \sum_{i=1}^{n_b} w_i B_i^t x_c$

where $w_i$ and $B_i^t$ are skinning weights and bone transforms. To generalize to poses not seen during training, methods often employ pose projection via principal component analysis (PCA), projecting novel poses into the manifold of training examples and clipping coefficients to prevent out-of-distribution artifacts (Li et al., 2023). For non-linear attributes such as rotation, recent work has proposed quaternion averaging for robust rotation blending instead of linear averaging, ensuring proper deformation for view-dependent Gaussian parameters (Zioulis et al., 14 Sep 2025).

4. Appearance Modeling and Physically-Based Relighting

Besides geometric deformation, canonical Gaussian avatars encode dynamic, view-dependent appearance. StyleGAN-based networks predict pose-conditioned Gaussian maps, modulating both color and geometric offsets; inputs include projected canonical templates, pose maps, and view direction encodings. Material properties such as albedo and roughness are additionally predicted for each Gaussian, allowing integration with physically-based rendering (PBR). The rendered radiance is computed as

$L_o(x, \omega_o) = \int_{\Omega} L_i(x, \omega_i) f(x, \omega_i, \omega_o; \alpha, \gamma, n) (\omega_i \cdot n) d\omega_i$

or, for efficiency, as discrete summations over learned environment light probes (Zhan et al., 15 Jul 2024). Gaussian avatars thus support realistic relighting under arbitrary illumination, with ancillary outputs such as albedo and normal maps enabling intrinsic image decomposition.

5. Extensions: Garment Decomposition, Multi-Human Scenes, and Semantic Enrichment

The framework supports explicit layering and garment transfer by modeling each garment as an independent Gaussian set, governed by regularization losses ensuring visibility, fitting, and structural similarity (Gong et al., 21 May 2024). Mesh-guided approaches tie each Gaussian to a triangle face and restrict movement along face normals, ensuring geometric consistency and enabling semantic annotation with foundation models (Fan et al., 18 Sep 2025). Methods scale efficiently to multi-human scenes by duplicating pipelines per avatar and perform incremental aggregation via transformer-based models (Wu et al., 27 Aug 2025), facilitating flexible data input and improving reconstruction quality with more observations.

6. Computational Efficiency and Comparative Performance

Canonical Gaussian avatars achieve fast reconstruction and real-time rendering compared to implicit radiance field approaches. Experimental results across THuman4.0, AvatarReX, ActorsHQ, VFHQ, and HDTF datasets report high PSNR, SSIM, and low LPIPS and FID values—typically outperforming NeRF-based and earlier splatting methods in both static image quality and dynamic animation (Li et al., 2023, He et al., 25 Feb 2025, Yan et al., 4 Mar 2025). Feed-forward transformer-based models can reconstruct avatars within seconds from single images or monocular video (Wu et al., 27 Aug 2025, He et al., 25 Feb 2025), with rendering speeds exceeding hundreds of FPS on modern GPUs.

Method	Training Time	Rendering FPS	Datasets	Benchmarked Metrics
Animatable Gaussians (Li et al., 2023)	~1 hour	Real-time (up to 170)	THuman4.0, AvatarReX	PSNR, SSIM, LPIPS, FID
FastAvatar (Wu et al., 27 Aug 2025)	<10 seconds	100s+	VFHQ, HDTF	PSNR, SSIM, LPIPS
GUAVA (Zhang et al., 6 May 2025)	<0.1 seconds	>50	Custom/real-time	PSNR, SSIM, LPIPS

7. Applications, Limitations, and Future Directions

Canonical Gaussian avatars support applications in AR/VR, telepresence, gaming, virtual production, and metaverse environments. Their explicit structure enables garment editing, real-time teleconferencing, and efficient semantic segmentation. Challenges remain in the modeling of highly loose or nonrigid garments, in handling extreme pose extrapolation, and in maximizing appearance fidelity from limited monocular data. Future research is suggested in end-to-end feed-forward architectures for further reduced optimization times and in integrating advanced cloth dynamics and generative priors to further improve realism and applicability.

Canonical Gaussian avatars represent a convergence of explicit geometric modeling, fast animation, and physical relighting within a unified canonical coordinate system. Their modularity and generalization capabilities make them the current preferred technology for efficient, high-fidelity digital human reconstruction and animation.