CAG-Avatar: Adaptive Gaussian Avatars

Updated 28 January 2026

CAG-Avatar is a real-time digital avatar model that adapts 3D Gaussian primitives using both global and local driving signals.
The approach employs cross-attention, tensorial representations, and patch-based conditioning to achieve fine-grained animation details.
Efficient rendering and compact representations enable scalable use in interactive systems, VR, and telepresence applications.

Conditionally-Adaptive Gaussian Avatars (CAG-Avatar) are a class of real-time, animatable digital avatar representations based on 3D Gaussian splatting, characterized by per-region or per-primitive adaptation of Gaussian parameters in response to global and/or local driving signals (e.g., blendshape codes, local expressions, or pose priors). This approach generalizes standard 3D Gaussian Avatar methods by enabling granular, spatially varying responses to facial dynamics, leading to higher fidelity and more robust head or body animation while maintaining the efficiency and explicitness that distinguishes Gaussian representations.

1. Gaussian Splatting and Parametrization

CAG-Avatar frameworks fundamentally employ a set of 3D Gaussian primitives to represent head or body geometry as well as appearance attributes. Each Gaussian $G_i$ is defined by position $\mu_i\in\mathbb{R}^3$ , anisotropic or isotropic scale $s_i$ (or full covariance $\Sigma_i\in\mathbb{R}^{3\times 3}$ ), orientation $R_i\in SO(3)$ (often parameterized by a quaternion), base opacity $\alpha_{c,i}$ (or learned $\alpha_i$ ), and feature vectors for view-dependent appearance (e.g., spherical harmonics, tri-plane features, or learned codes). Rasterization is performed via explicit depth-sorted accumulation of each Gaussian's projected contribution onto the 2D image plane, allowing for high-framerate, real-time rendering (Wang et al., 21 Apr 2025, Chang et al., 21 Jan 2026).

Spatial and attribute adaptivity is achieved by conditionally modulating each Gaussian's attributes, either through cross-attention–based fusion of global expression codes (as in CAG-Avatar (Chang et al., 21 Jan 2026)), per-patch or per-line local expression codes (as in ScaffoldAvatar (Aneja et al., 14 Jul 2025) and tri-plane/feature-line methods (Wang et al., 21 Apr 2025)), or via conditioned neural decoders (e.g., U-Nets in PGHM (Peng et al., 7 Jun 2025)).

2. Adaptive Conditioning and Fusion Strategies

CAG-Avatar methods systematically depart from global, one-size-fits-all driving codes by introducing conditioning modules that allow each Gaussian, patch, or anchor point to separately query or receive regionally relevant expression or pose information.

Cross-Attention Fusion: Each Gaussian's canonical position is used as a query into the global expression code, producing a location-specific context vector via softmax attention. This context is concatenated with positional features and decoded to predict per-Gaussian offsets to position, orientation, and scale, enabling fine-grained, localized dynamics particularly critical for distinguishing deformable areas (e.g., skin) from rigid structures (e.g., teeth, jaw) (Chang et al., 21 Jan 2026).
Tensorial and Tri-Plane Representations: Static appearance is encoded in compact tri-plane feature volumes, while dynamic (expression-dependent) details are modeled by lightweight 1D feature lines indexed along canonical axes. The resulting features are combined and decoded to predict local appearance/opacity offsets, with expression mixing performed via learned or data-driven blending of feature lines (Wang et al., 21 Apr 2025).
Patch- and Anchor-Based Conditioning: ScaffoldAvatar employs a geometric patch decomposition, fitting per-patch blendshape coefficients and assigning anchor points within each patch. Per-patch, per-anchor, and global codes are fed into lightweight MLPs that output all Gaussian attributes per region, targeting microfeatures and dynamic skin motion with high granularity (Aneja et al., 14 Jul 2025).

These mechanisms establish a hierarchy of conditioning—from global (subject-wide) codes to per-primitive adaptation—eliminating the artifacts and blurring inherent in global-only approaches.

3. Dynamic Texture and Geometry Encoding

CAG-Avatar architectures encode both neutral (static) and dynamic (expression-varying) appearance and geometry:

Static Neutral Appearance: View-dependent static features are stored in tri-planes or UV-aligned lattices and mapped to each Gaussian or surface point. These are decoded into color using small MLPs (Wang et al., 21 Apr 2025, Peng et al., 7 Jun 2025).
Dynamic, Expression-Driven Detail: Lightweight 1D feature lines (Wang et al., 21 Apr 2025), patch-local encodings (Aneja et al., 14 Jul 2025), or per-Gaussian offsets predicted via cross-attention (Chang et al., 21 Jan 2026)/FLAME-conditioned networks (Fazylov et al., 6 Dec 2025) supply dynamic adaptation. Expression code mixing is handled either by attention, PCA, or direct weighting.
Separation of Identity and Expression: Architectures such as PGHM (Peng et al., 7 Jun 2025) and AGORA (Fazylov et al., 6 Dec 2025) explicitly factor identity (via latent code or canonical mesh anchors) and expression (via conditioning branches or residual generators) to disentangle appearance and behavior, facilitating fast subject-specific adaptation.

This explicit factorization supports robust animation under varying facial and pose conditions without sacrificing avatar-specific details.

4. Training, Regularization, and Sampling

CAG-Avatar training integrates specialized regularization and data balancing strategies:

Adaptive Opacity Penalty: To prevent opacity leakage in static or near-static mesh regions, penalties are imposed on Gaussians that exhibit excessive opacity except where substantial mesh movement is observed, with thresholds dynamically set per batch or percentile (Wang et al., 21 Apr 2025).
Class-Balanced Expression Sampling: Expression or pose classes (obtained via clustering on deformation statistics) are balanced in the training mini-batch to address data imbalance that otherwise leads to poor coverage of rare or extreme expressions (Wang et al., 21 Apr 2025). Sampling probability per class $P(e)\propto 1/N_e$ enforces uniform class representation.
Hierarchical LoD and ROI-Selective Optimization: LoDAvatar and similar methods employ multi-level Gaussian hierarchies, enabling runtime switching or pruning of Gaussians based on viewpoint distance, screen-space error, or budgets; this maintains performance and visual quality, especially in multi-subject scenarios (Dongye et al., 2024).
Regularization of Geometry and Scale: L2/L1 penalties on offsets, scales, and opacities, as well as anchor denoising and scale clamping, are applied across architectures, ensuring stability and preventing degenerate solution drift during extended training (Wang et al., 21 Apr 2025, Aneja et al., 14 Jul 2025, Peng et al., 7 Jun 2025).

Losses typically combine pixelwise, perceptual (LPIPS), and structure-aware (SSIM) components, with auxiliary weighting for particularly challenging regions (e.g., mouth) (Chang et al., 21 Jan 2026).

5. Rendering Efficiency and Memory Footprint

CAG-Avatar frameworks explicitly prioritize real-time rendering and storage efficiency:

Method	Storage / subject	Speed (FPS, GPU)	Novel-View PSNR	Self-Reenact PSNR	Notable features
CAG-Avatar (Wang et al., 21 Apr 2025)	10 MB	300 (RTX 4090)	32.97	28.07	Tri-planes, feat. lines
GA (Wang et al., 21 Apr 2025)	21 MB	-	-	-	Baseline Gaussian Avatars
GHA (Wang et al., 21 Apr 2025)	120 MB	-	-	-	Gaussian Head Avatars
GBS (Wang et al., 21 Apr 2025)	2 GB	-	-	-	Gaussian Body Splatting
PGHM (Peng et al., 7 Jun 2025)	-	-	31.85	-	Per-subject 20 min adaptation
AGORA (Fazylov et al., 6 Dec 2025)	-	250 (A6000)	-	-	9 FPS CPU, explicit fast splats

Efficient real-time rendering is achieved through:

Explicit rasterization (avoiding ray marching).
Adaptive per-primitive sparseness: Storing dynamic features in compact tensorial forms (tri-planes, feature lines, anchor sets).
Selective inference paths (caching identity branches, evaluating only lightweight adaptors per frame).

This design supports deployment in interactive and multi-user environments, enabling live animation and view synthesis.

6. Empirical Results and Comparative Evaluation

CAG-Avatar methods consistently outperform global-conditioning or NeRF-based animation in image fidelity, detail recovery, and animation smoothness:

Nensemble dataset (CAG-Avatar (Wang et al., 21 Apr 2025)): PSNR≈32.97, SSIM≈0.9506, LPIPS≈0.0594 for novel view synthesis. Self-reenactment PSNR≈28.07. Storage requirement is halved compared to Gaussian Avatars (GA).
ScaffoldAvatar (Aneja et al., 14 Jul 2025): For close-up head synthesis at 3K resolution, achieved PSNR≈34.5, SSIM≈0.971, LPIPS≈0.126. Outperforms state-of-the-art (GaussianAvatars, GHA, NPGA) on facial region metrics and microfeature fidelity.
LoDAvatar (Dongye et al., 2024): LoD3 achieves PSNR≈30.3 dB, SSIM≈0.98, LPIPS≈0.04; controlled LoD switching supports significant performance gains with tunable visual quality.
AGORA (Fazylov et al., 6 Dec 2025): FID=3.17 (competitive with Next3D and EG3D), mean AED=0.682 (vs. 0.930), APD=0.025, running at 250 FPS on GPU and 9 FPS on CPU. Explicit deformation modeling yields crisper facial features and expression accuracy than NeRF-based methods.

The adaptive, per-region conditioning consistently preserves high-frequency texture and localized articulatory detail, eliminating the blurring and distortion artifacts associated with uniform global modulation.

7. Significance and Extensions

CAG-Avatar architectures represent a fundamental advance in 3DGS-driven digital human modeling, providing:

Granular control: By decoupling global driving signals and enabling per-region adaptation, CAG-Avatar models excel at capturing non-rigid deformations and microfeature dynamics across a variety of facial expressions and poses.
Efficiency and scalability: Compact representations (tri-planes; lightweight feature lines; patch-level MLPs) reduce footprint and offer real-time performance suitable for live animation scenarios, including multi-user and interactive systems.
Generalizability: Architectures such as PGHM allow rapid adaptation from monocular video to new subjects in minutes, bypassing the multi-view, hours-to-days optimization required by prior 3DGS approaches.
Compositionality and extendibility: Multi-level and region-of-interest detail management (as in LoDAvatar) supports rendering efficiency and rendering quality even for large scenes or multiple avatars.

CAG-Avatar approaches have rapidly become the foundation for high-fidelity, expressive, and computationally tractable digital human avatars across graphics, VR, telepresence, and entertainment applications (Wang et al., 21 Apr 2025, Chang et al., 21 Jan 2026, Dongye et al., 2024, Aneja et al., 14 Jul 2025, Fazylov et al., 6 Dec 2025, Peng et al., 7 Jun 2025).