Papers
Topics
Authors
Recent
Search
2000 character limit reached

ParDy-Human: Dynamic 3D Avatar Synthesis

Updated 7 March 2026
  • The paper introduces an explicit 3D Gaussian splatting framework that enables high-fidelity, controllable human avatar synthesis from minimal input data.
  • It employs a two-stage deformation pipeline combining SMPL-based rigid transformation with a Deformation Refinement Module to capture fine-scale dynamics like garment motion.
  • Quantitative evaluations demonstrate superior performance, achieving real-time rendering and reduced supervision compared to implicit neural radiance field methods.

ParDy-Human (Parameterized Dynamic Human Avatar) is an explicit, deformable 3D Gaussian splatting framework for constructing animatable digital human avatars from as few as a single monocular video sequence or a sparse multiview camera setup. Unlike implicit neural radiance field (NeRF)–based approaches, which require dense multi-view data and auxiliary per-frame annotations such as masks or UV maps, ParDy-Human enables high-fidelity, high-resolution avatar synthesis with significantly reduced supervisory requirements. The method embeds parameter-driven animation dynamics into the explicit 3D Gaussian splatting paradigm, achieving real-time rendering and explicit pose control via SMPL body model parameterization (Jung et al., 2023).

1. Motivation and Conceptual Framework

Prior methods for dynamic human avatar synthesis, such as neural radiance fields and implicit volumetric representations, provide photorealistic novel-view and novel-pose rendering but are hindered by the need for dense synchronized captures, per-frame segmentation, or explicit geometry. Kerbl et al.'s 3D Gaussian Splatting (3D-GS) augmented explicit point-based rendering with real-time capabilities, but was largely constrained to static scenes.

ParDy-Human addresses these limitations by explicitly encoding both humans and background as unstructured clouds of anisotropic 3D Gaussians, where each Gaussian’s motion is parameterized by SMPL body pose and complemented by a learned residual deformation for non-rigid details such as clothing dynamics. The approach offers two principal advantages: (1) efficiency, enabling training and inference with fewer input images and consumer hardware, and (2) explicit controllability of pose and shape through SMPL parameters.

2. Two-Stage Deformation Pipeline

Avatar animation in ParDy-Human proceeds through a two-stage deformation process:

  1. SMPL-Based Rigid Deformation: Each Gaussian GiG_i is initially placed on a canonical SMPL T-pose mesh. For each SMPL face ii, the Gaussian’s canonical mean μi\mu_i and covariance Σi\Sigma_i are transformed by the per-face rigid transformation (Rti,tti)(R^i_t, t^i_t) derived from the mapping between canonical and posed SMPL surfaces. The rigidly-deformed Gaussian has:

μid=Rtiμi+tti,Σid=RtiΣi(Rti)T\mu_i^d = R^i_t\,\mu_i + t^i_t,\quad \Sigma_i^d = R^i_t\,\Sigma_i\,(R^i_t)^T

  1. Deformation Refinement Module (DRM):

SMPL-based skinning is insufficient for garment and hair deformations. For each posed Gaussian center μid\mu^d_i, an encoding Edi=μidJtE^i_d = \mu^d_i - J_t (where JtJ_t is the nearest SMPL joint) is input into a 13-layer fully connected MLP with ReLU activations and highway skip connections. This predicts a 7-dimensional residual: - Translation ΔtiR3\Delta t^i \in \mathbb{R}^3 - Axis-angle rotation (ui, αi)(u^i,\ \alpha^i)

The final Gaussian parameters are:

μir=Rriμid+tri,Σir=RriΣid(Rri)T\mu^r_i = R^i_r\,\mu^d_i + t^i_r,\quad \Sigma^r_i = R^i_r\,\Sigma^d_i\,(R^i_r)^T

where (Rri,tri)(R^i_r,\, t^i_r) is the rigid transform computed from the DRM output. This two-tier architecture permits factorized modeling of articulated human shape and secondary, fine-scale dynamics.

3. Rendering Process

Rendering is performed by projecting each refined 3D Gaussian as a 2D ellipse (splat) onto the image plane. Each Gaussian retains spherical harmonics coefficients for view-dependent appearance and is rendered as follows:

  • Project μir\mu^r_i to pixel coordinates pip_i with an ellipse characterized by Σir\Sigma^r_i.
  • Compute view directions:
    • Camera direction: dreli=(cμir)/cμird^i_{rel} = (c - \mu^r_i)/\|c - \mu^r_i\|
    • Normal-corrected: dni=RriRtinid^i_n = R^i_r\,R^i_t\,n_i (nin_i: SMPL face normal)
  • Splat ellipses in painter’s algorithm order (back-to-front) using alpha blending, accumulating both opacity and color contributions.

This pipeline produces full-resolution (1280×10241280\times1024) frames at $0.25$–$0.3$ seconds per frame on RTX 3080-MaxQ hardware.

4. Supervision, Losses, and Optimization

ParDy-Human is trained end-to-end without any mask, depth, or UV supervision. Training proceeds by matching the RGB renderings y^j\hat{y}_j with input images IjI_j via a weighted compound loss:

L=λL1y^I1+λSSIM(1SSIM(y^,I))+λLPIPSLPIPS(y^,I)L = \lambda_{L1} \|\hat{y} - I\|_1 + \lambda_{SSIM} (1 - SSIM(\hat{y}, I)) + \lambda_{LPIPS} LPIPS(\hat{y}, I)

Typical weights are λL1=0.6,λSSIM=0.4,λLPIPS=0.4\lambda_{L1} = 0.6, \lambda_{SSIM} = 0.4, \lambda_{LPIPS} = 0.4.

Parameter optimization is staged:

  • Iterations 0–10k: simultaneous update of all parameters.
  • Iterations 10k–100k: alternating blocks freezing either Gaussians or DRM, combined with an adaptive split/merge scheme for Gaussians (as in Kerbl et al.).

5. Quantitative and Qualitative Evaluation

Experiments on ZJU-MoCap (four subjects) and THUman4.0 (three subjects) with seven-camera subsets and 1-in-50 frame sampling demonstrate superior quantitative and visual performance relative to state-of-the-art implicit methods, UV-Volumes [Chen et al.] and PoseVocab [Li et al.].

Dataset PSNR (UVV / PV / Ours) SSIM (UVV / PV / Ours) LPIPS (UVV / PV / Ours)
ZJU-MoCap 28.5 / 22.8 / 27.6 0.960 / 0.962 / 0.963 0.051 / 0.045 / 0.038
THUman4.0 26.9 / 25.8 / 28.5 0.949 / 0.964 / 0.966 0.064 / 0.054 / 0.044

Qualitative findings indicate sharper garment details, fewer artifacts under pose variation, and over 50×50\times greater inference speed compared to PoseVocab (0.3 s vs 15 s). UV-Volumes handles illumination but suffers from geometric artifacts, while PoseVocab generalizes poorly on texture fidelity.

6. Ablation Studies and Impact of Key Modules

Ablation confirms the necessity of DRM: omitting DRM ("wo.D") leads to blurred/ghosted textures and a 10%\sim10\% increase in LPIPS, with visible motion artifact boundaries. Use of ground-truth masks (“w.M”) degrades boundary geometry, producing "halo" artifacts, whereas mask-free training yields cleaner silhouettes. Single-view training enables plausible 3D avatars for seen poses, although the system cannot hallucinate appearance for occluded regions.

7. Advantages, Limitations, and Prospects

ParDy-Human’s explicit, point-based architecture enables real-time, high-resolution inference on consumer hardware, providing efficient learning from as few as seven or even a single view, without mask or depth requirements. Pose control is direct via SMPL parameters. Identified limitations include the merging of Gaussians—resulting in spikes—when representing large, uniform garments, and the inability to synthesize unobserved (self-occluded) surfaces. Prospective enhancements include learned texture modules for improved shading, fusion of sparse structure-from-motion backgrounds, and integration with generative priors (e.g., diffusion models or variational autoencoders) to hallucinate occluded or seldom-observed content (Jung et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ParDy-Human.