Gaussian Pixel Codec Avatars (GPiCA)

Updated 18 December 2025

Gaussian Pixel Codec Avatars are a hybrid representation that combines textured triangle meshes for flat surfaces and anisotropic 3D Gaussian splats for volumetric details.
The framework enables efficient, photorealistic rendering on mobile platforms by unifying mesh and volumetric data in a differentiable, optimized pipeline.
GPiCA achieves high-fidelity rendering for challenging regions like hair and beard, outperforming previous methods in quality metrics such as MAE, PSNR, SSIM, and LPIPS.

Gaussian Pixel Codec Avatars (GPiCA) are a class of photorealistic, animatable head avatars designed for real-time rendering with high visual fidelity and computational efficiency. The GPiCA framework achieves this by combining a triangle mesh and anisotropic 3D Gaussian splats within a unified, differentiable rendering pipeline, enabling efficient representation of both surface and volumetric structures. The hybrid approach allows high-quality avatar synthesis on resource-constrained devices such as mobile GPUs while delivering volumetric detail—especially for challenging regions like hair and beard—that is unattainable with mesh-only or Gaussian-only representations (Gupta et al., 17 Dec 2025).

1. Hybrid Scene Representation

GPiCA leverages a hybrid composition of two fundamental primitives:

Textured Triangle Mesh: Used for efficiently representing the “mostly flat” surface regions (skin, lips, eyelids). The mesh backbone is based on the PiCA neural mesh parameterization [Ma et al. CVPR '21]: a UV-space fully convolutional decoder ( $\mathcal D_m$ ) produces per-vertex positions $X \in \mathbb{R}^{K^2 \times 3}$ , RGB texel colors $T_c \in \mathbb{R}^{K \times K \times 3}$ , and an opacity map $T_\alpha \in \mathbb{R}^{K \times K}$ .
Sparse Anisotropic 3D Gaussians (“splats”): Deployed in regions where volumetric or thin structures (hair, beard, eyelashes) are required. Each Gaussian is parameterized by a center $\mathbf t_k$ , rotation $\mathbf R_k \in SO(3)$ , scale $\mathbf s_k \in \mathbb{R}^3$ (lengths of principal axes), learned opacity $o_k \in [0, 1]$ , and view-dependent color $\mathbf c_k$ .

All parameters are predicted by convolutional decoders operating in UV space, leveraging a shared latent code $\mathbf z$ for a unified avatar identity and expression representation. This design drastically reduces the total number of Gaussians: GPiCA typically uses $\sim 16,384$ splats (with 75% targeted to hair regions, guided by semantic UV segmentation) as opposed to tens or hundreds of thousands in purely Gaussian avatars (Gupta et al., 17 Dec 2025).

2. Unified Differentiable Rendering Pipeline

Rendering in GPiCA follows a two-pass compositing scheme:

Mesh Pass: The neural mesh and its RGBA texture are rasterized via a fully differentiable GPU mesh renderer, producing per-pixel color ( $C'_p$ ), opacity ( $\alpha'_p$ ), and depth ( $d'_p$ ).
3D Gaussian Splatting Pass: All Gaussians are sorted by camera-space depth. For each pixel $p$ $p$ , three compositing segments are accumulated:
- Front Gaussians ( $d_k < d'_p$ ):
$C_{\mathrm{front}} = \sum_{k=1}^{m-1} \mathbf{c}_k\,\alpha_k \prod_{j<k} (1 - \alpha_j)$

Mesh Contribution:

$C_{\mathrm{mesh}} = C'_p\,\alpha'_p \prod_{j<m} (1 - \alpha_j)$
Behind Gaussians ( $d_k \geq d'_p$ ):

$C_{\mathrm{behind}} = (1 - \alpha'_p) \sum_{k=m}^{N} \mathbf{c}_k\,\alpha_k \prod_{j<k} (1 - \alpha_j)$

The final pixel color is:

$C_p = C_{\mathrm{front}} + C_{\mathrm{mesh}} + C_{\mathrm{behind}}$

This procedure treats the mesh as a semi-transparent volumetric layer within a standard Gaussian splatting framework, with all steps (mesh rasterization, Gaussian-to-screen projection, alpha blending) being differentiable. This design enables end-to-end optimization from input images through the latent code to the rendered output (Gupta et al., 17 Dec 2025).

3. Neural Networks and Supervision

A VAE-style architecture underpins the latent expression and geometry code ( $\mathbf z$ ), with an encoder ingesting tracked meshes and canonicalized average UV textures from multi-view capture. Two UV-space decoders—for mesh ( $\mathcal D_m$ ) and Gaussians ( $\mathcal D_g$ )—output the respective geometry and appearance parameters for the hybrid representation. Input is a tracked PiCA mesh plus textures; output is a set of parameters that define the final animatable avatar (Gupta et al., 17 Dec 2025).

Supervision is provided by photometric L2 loss between rendered and ground truth pixels for all camera views: $\mathcal L_{\mathrm{rgb}} = \sum_p \|C_p - C_p^{\mathrm{GT}}\|^2$ Additional constraints include KL-regularization for the latent code, Laplacian and normal smoothness for the mesh, as well as explicit scale and positional regularizers for Gaussians: $\mathcal L_s = \mathrm{mean}_k\,\ell_s(s_k), \quad \mathcal L_t = \mathrm{mean}_k\,\ell_t(\|\mathbf t_k\|)$ where

$\ell_s(s) = \begin{cases} \tfrac{1}{\max(s, 10^{-7})}, & s < 0.1 \ (s-10)^2, & s > 10 \ 0, & \text{otherwise} \end{cases}$

and $\ell_t(r) = \max(0,\, r-10)$ (Gupta et al., 17 Dec 2025).

4. Performance Characteristics and Benchmarks

GPiCA demonstrates both quantitative and qualitative gains relative to previous mesh-only (PiCA) and Gaussian-only avatars:

Accuracy: On five subjects, hybrid GPiCA (16 K splats) outperforms mesh-only and 16 K-Gaussian avatars in MAE, PSNR, SSIM, and LPIPS, and matches or slightly improves upon the quality of 65 K-Gaussian models. Qualitative comparisons highlight the hybrid's capacity for crisp skin rendering (from meshes) and detailed, volumetric representation of hair and beard (from Gaussians) (Gupta et al., 17 Dec 2025).
Speed: On devices such as the Meta Quest 3 (Adreno 740), the GPiCA decoder (mesh plus Gaussians) requires 6.9 ms; the unified renderer delivers mesh rasterization in 1.63 ms and mesh+Gaussian hybrid splatting in 10.9 ms (versus 19.3 ms for 65 K-Gaussians), with end-to-end rendering at $\sim 72$ Hz for $2048 \times 1334$ output resolution (Gupta et al., 17 Dec 2025).
Ablations: Semi-transparent mesh handling outperforms opaque mesh hybrids for integrating hair splats. Hair-prioritized splat initialization (UV masking) further improves appearance over uniform allocation.

5. Relation to Other Gaussian Codec Avatars

Earlier Gaussian Codec Avatar methods, such as the relightable variant (Saito et al., 2023), relied on a purely volumetric parameterization for both geometry and appearance, encoding a head with $M$ 3D anisotropic Gaussian splats. This approach excelled at modeling sub-millimeter detail and intricate volumetric structures (notably hair strands and pores) through learned per-Gaussian parameters. Appearance modeling utilized a learnable radiance transfer with global illumination-aware spherical harmonics for diffuse, and spherical Gaussian lobes for all-frequency specular reflectance, enabling real-time relighting. The rendering pipeline was based on elliptically weighted average (EWA) splatting, and compositing per-pixel color using volumetric accumulation (Saito et al., 2023).

However, large numbers of Gaussians (hundreds of thousands to millions) are necessary to achieve surface smoothness and high-fidelity rendering, limiting real-time applicability on resource-constrained hardware. By offloading most of the smooth surface regions to a mesh and restricting high-density Gaussians to volumetric details, GPiCA achieves comparable visual fidelity with significantly lower computational cost (Gupta et al., 17 Dec 2025).

6. Limitations and Research Directions

While GPiCA advances the state of the art in mobile photorealistic avatar synthesis, certain constraints remain:

The current framework is optimized for static head shapes modulated by expression latents; modeling speech-driven dynamics or extreme deformations would require either time-varying splats or per-frame mesh retopology.
Further reduction of Gaussian count—potentially via dynamic splat pruning or continual importance sampling—may lower memory requirements for even more constrained environments.
The hybrid structure is conducive to real-time editing, such as re-styling hair or live illumination changes, by selectively updating mesh versus volumetric components (Gupta et al., 17 Dec 2025).

A plausible implication is that this hybrid paradigm may generalize to other complex objects in scene reconstruction, where volumetric and surface features coexist and efficiency is paramount.

7. Summary Table: Comparison of Core Elements in GPiCA and Prior Methods

Element	GPiCA Hybrid (Gupta et al., 17 Dec 2025)	Pure Gaussian Codec (Saito et al., 2023)
Surface Regions	Neural triangle mesh + RGBA texture	Dense 3D Gaussian splats
Volumetric/Hair	Sparse, anisotropic 3D Gaussians (~16K)	Dense 3D Gaussians (65K–1M+)
Appearance	View-conditioned color (UV-space decoders)	SH/SG reflectance, per-Gaussian SH
Rendering	Two-pass: mesh raster, Gaussian splatting	EWA splatting, volumetric accumulation
Efficiency	~72 Hz on mobile GPU (2048×1334)	18 ms/frame for 1M splats (A100 GPU)

The emergence of Gaussian Pixel Codec Avatars marks a significant step toward efficient, scalable, and photorealistic neural head avatar synthesis, integrating the strengths of mesh and volumetric paradigms within a single, optimizable framework suitable for real-time applications (Gupta et al., 17 Dec 2025, Saito et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering (2025)

Relightable Gaussian Codec Avatars (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Gaussian Pixel Codec Avatars (GPiCA).