CloseUpAvatar: High-Fidelity 3D Avatars
- CloseUpAvatar is a photorealistic 3D avatar paradigm using surfel-based representations and adaptive multi-scale texture blending for detailed close-ups and full-body views.
- It integrates SMPL-X pose initialization with linear blend skinning, enabling real-time animation and rendering at high FPS while preserving fine image details.
- Comparative evaluations show it outperforms traditional mesh and Gaussian methods, achieving superior quality metrics and efficient real-time performance.
A CloseUpAvatar represents one of the most advanced paradigms for photorealistic, animatable, and high-fidelity 3D human avatars, emphasizing robust rendering quality for both full-body and head close-up views—even under challenging camera motions and extreme zoom. This concept is characterized by novel hybrid avatar parametrizations, efficient multi-scale texture mechanisms, and specialized neural architectures that collectively address the technical constraints of real-time performance, high-detail preservation, and animation-readiness expected by contemporary research and applications (Svitov et al., 3 Dec 2025).
1. Parametric Representation and Multi-Scale Texture Architecture
CloseUpAvatar encodes a human avatar as a set of textured planes ("surfels") rather than traditional meshes or dense volumetric grids. Each surfel carries:
- Position
- 2D spatial scale
- Orientation as a unit quaternion
- Per-surfel 4×3 view-dependent SH coefficients for appearance
- Two aligned, learnable RGBA textures per surfel:
- Coarse (low-frequency) texture
- Fine (high-frequency) texture
During rendering, both coarse and fine textures are sampled at the local surfel UV coordinate , and the mixture weight for high-frequency detail is determined by the surfel’s screen-space size:
This adaptive texture blending guarantees that high-frequency details—crucial for close-ups—are only synthesized when the camera is sufficiently close, reducing computation and avoiding aliasing or excessive blur at distance (Svitov et al., 3 Dec 2025).
2. Pose and Animation Rigging
All surfels are initialized from an SMPL-X mesh and oriented along local surface normals. To support articulate animation under arbitrary human motion, surfels are skinned to the body using standard Linear Blend Skinning (LBS). This permits real-time deformation in response to sequence or motion-capture (MoCap) driving, with surfel positions and orientations dynamically updated to reflect joint rotations and pose parameters (Svitov et al., 3 Dec 2025).
This paradigm preserves animatability and allows for seamless integration with industry-standard motion processing pipelines. Unlike earlier avatar systems that lacked skinnability or required very high primitive counts, CloseUpAvatar’s surfel-count remains manageable (≈20K), enabling real-time animation and rendering at FPS rates far exceeding previous mesh- or Gaussian-based avatars.
3. Training Objectives, Losses, and Geometry Regularization
The training regime combines pixel-based, perceptual, and structural constraints: where:
- : (early) and MSE (late) on rendered vs. groundtruth pixels
- : Multi-scale Structural Similarity Index (SSIM) loss
- : Learned Perceptual Image Patch Similarity
- : Geometry priors (Laplacian smoothing, scale regularization, normal/depth consistency).
Regulation of neighboring surfel offsets and enforcement of physically plausible surfel scales preclude geometric artifacts and irregular deformation. Hyperparameters are set for optimal convergence and balance between photometric accuracy and surface regularity (, , etc.) (Svitov et al., 3 Dec 2025).
4. Rendering Pipeline and Real-Time Performance
Each frame involves:
- Skinned transformation of canonical surfel positions/orientations via LBS.
- Intersecting rays with surfels using efficient CUDA-based ray-splat rasterization.
- Sampling and alpha-blending of low- and high-frequency textures per surfel hit.
- Depth-based compositing in front-to-back order.
Only ≈20K surfels are needed for comprehensive full-body and close-up coverage, achieving ≈244 FPS for close-up and ≈350 FPS when zoomed-out on RTX 4090-class GPUs—outperforming Gaussian-based methods both in throughput and fine-detailed preservation (Svitov et al., 3 Dec 2025).
5. Comparative Evaluation and Quantitative Results
CloseUpAvatar demonstrates superior or competitive image quality (quantified by PSNR, SSIM, LPIPS, FID) and higher inference speed relative to mesh-based, animatable Gaussian, and prior hybrid avatars. On ActorsHQ, CloseUpAvatar achieves, for zoom-out conditions, PSNR = 36.43, SSIM = 0.990, LPIPS = 0.055, FID = 22.1, and 350 FPS; and for zoom-in, PSNR = 27.69, SSIM = 0.735, LPIPS = 0.223, FID = 38.5, and 244 FPS. Blurring and artifacting that degrade high-frequency facial/body features in Gaussian-only representations (particularly under close-up) are mitigated by the adaptive multi-scale surfel texturing strategy (Svitov et al., 3 Dec 2025).
| Method | PSNR (in) | SSIM (in) | LPIPS (in) | FID (in) | FPS (in) | PSNR (out) | SSIM (out) | LPIPS (out) | FID (out) | FPS (out) |
|---|---|---|---|---|---|---|---|---|---|---|
| MeshAvatar | 24.16 | 0.716 | 0.319 | 66.7 | 11 | 33.10 | 0.982 | 0.087 | 36.1 | 27 |
| AnimatableGaussians | 28.53 | 0.737 | 0.311 | 49.5 | 16 | 33.90 | 0.987 | 0.058 | 23.7 | 15 |
| Mmlphuman | 27.64 | 0.730 | 0.300 | 57.2 | 279 | 33.62 | 0.986 | 0.068 | 31.2 | 232 |
| CloseUpAvatar (Ours) | 27.69 | 0.735 | 0.223 | 38.5 | 244 | 36.43 | 0.990 | 0.055 | 22.1 | 350 |
In=zoom-in (close-up), Out=zoom-out. (Svitov et al., 3 Dec 2025)
6. Limitations and Prospective Extensions
Current CloseUpAvatar implementations encounter challenges when representing very small-scale, non-rigid geometrical detail (e.g., finger tips, facial micro-expressions) due to surfel coarseness in those anatomical zones. High-frequency geometric displacements (wrinkles, nails) are not explicitly modeled and are delegated to texture channels. Notable avenues for advancement include:
- Introducing hybrid primitives—adaptive surfel sizing at high-curvature or detail-critical regions (hands, face)
- Integrating learned micro-displacement (fine geometry) maps atop the surfel framework
- Joint lighting/relighting optimization for dynamic illumination scenarios and physically based rendering
- Explicit modeling of additional modalities (specular, roughness) for relightable avatars
Further research directions envisage extending these representations with fine-grained adaptive priors and multi-modal texture/geometry learning (Svitov et al., 3 Dec 2025).
7. Relationship to Prior Art and Position in the Avatar Landscape
CloseUpAvatar diverges from mesh-only (Zhu et al., 2021), Gaussian-only (Aneja et al., 14 Jul 2025, Li et al., 24 Nov 2025), implicit neural field (Zeng et al., 2023), and triplane-based (Liu et al., 25 Mar 2025) avatar representations by leveraging surfel "billboards" with multi-scale, learnable textures that are selectively blended. This formulation achieves a crucial balance between real-time execution, animation-readiness, and preservation of photorealistic close-up details for applications requiring extreme camera proximity.
Other systems—such as ScaffoldAvatar (Aneja et al., 14 Jul 2025), AvatarBrush (Li et al., 24 Nov 2025), and FaceCraft4D (Yin et al., 21 Apr 2025)—provide orthogonal advances in head-level fidelity, local editability, and 4D dynamics, but do not provide the same full-body, camera-distance-adaptive rendering pipeline coupled to articulated animation at comparable scalability and frame rates.
The CloseUpAvatar architectural paradigm is positioned as a foundational building block for next-generation, highly realistic, and performance-optimized digital human representations (Svitov et al., 3 Dec 2025).