InpaintHuman: Animatable 3D Avatars from Video
- InpaintHuman constructs animatable 3D human avatars from heavily occluded monocular video to preserve identity and realistic geometry.
- It employs a multi-scale UV-parameterized 3D Gaussian framework coupled with personalized diffusion-based inpainting for accurate reconstructions.
- Performance exceeds state-of-the-art approaches in both synthetic and real-world benchmarks, improving on PSNR and visual fidelity.
InpaintHuman is a two-stage, analysis-by-synthesis framework for reconstructing complete, animatable 3D human avatars from heavily occluded monocular video using a combination of coarse-to-fine UV-parameterized 3D Gaussian splatting and personalized, semantic-guided diffusion-based inpainting. It addresses challenges in faithful geometry recovery and identity preservation under conditions of severe and persistent occlusion, surpassing previous state-of-the-art interpolation- and supervision-guided approaches in both synthetic and real-world benchmarks (Fan et al., 5 Jan 2026).
1. Multi-Scale UV-Parameterized Canonical Representation
InpaintHuman adopts a canonical human body parameterization based on the SMPL mesh model. Every 3D surface point is associated with UV coordinates via the SMPL template, effectively mapping the human surface to a 2D manifold. Over this manifold, a set of multi-scale feature maps at increasing resolutions (e.g., 64², 128², 256²) is defined.
During canonical representation construction:
- Each Gaussian sample at with receives a bilinearly interpolated feature per scale .
- Canonical features are aggregated as or by a learned blending weight:
where is a learned per-texel function and , are coarse and fine-level features.
Optionally, a pose-dependent feature is added using a ControlNet-style encoder of the posed SMPL mesh. The combined feature is decoded via an MLP to predict canonical Gaussian attributes: offset , scale , and color . This multi-scale, hierarchical interpolation enables coarse features to propagate context for large occlusions, while fine scales preserve detailed geometry.
2. Personalized Diffusion-Based Inpainting Module
While UV interpolation propagates features across observed regions, it cannot hallucinate unseen areas (e.g., consistently occluded body parts). InpaintHuman introduces a diffusion-based inpainting module, personalized to the subject, to synthesize missing textures and geometry while tightly preserving subject identity and plausible semantics.
Key components:
- Textual Inversion: A unique token is inserted into the text prompt vocabulary, with an embedding learned such that diffusion denoising, conditioned on , reconstructs visible subject images. The loss is:
with the noised latent, the U-Net denoiser, and the text encoder.
- Semantic-Conditioned ControlNet Guidance: For each frame, a semantic part-label map is rendered from SMPL at pose . ControlNet branch injects these priors into the diffusion to enforce spatial coherence with the underlying body layout.
- Joint Inpainting Loss: Self-supervised training randomly masks pixels in visible crops, requiring the system to recover them using a sum of standard diffusion, pixel reconstruction, and optional identity-matching losses:
with as the denoising+semantic loss, a pixel-level loss, and (optionally) a feature-matching identity constraint (e.g., via a face recognition network).
This approach curbs “identity drift” and ensures that the inpainted regions are temporally consistent and subject-specific.
3. End-to-End Algorithmic Pipeline
The InpaintHuman pipeline consists of three major stages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
initialize UV feature maps F_l, decoder D
repeat until convergence:
for each frame i:
render Gaussians with current F_l, D -> I_hat_i
compute L_init = sum_{p in M_vis^i} |I_hat_i(p) - I_i(p)|
update F_l, D via gradient of L_init
collect visible crops I_i^{vis}, visibility masks M_vis^i
learn text-token V* via L_TI
fine-tune inpainting LoRA parameters to minimize L_inpaint
for each frame i:
run personalized inpainting -> I_hat_i^{full}, full mask M_full^i
render UV->I_hat_i^{ref}
compute L_refine = sum_{p in M_full^i} |I_hat_i^{ref}(p)−I_hat_i^{full}(p)| + lambda_ssim⋅SSIM + lambda_lpips⋅LPIPS
update F_l, D via gradient of L_refine
|
4. Experimental Evaluation and Benchmarks
InpaintHuman was evaluated on PeopleSnapshot (synthetic occlusions), ZJU-MoCap (central-block masking; 100 frames for training, 22 views for testing), and OcMotion (real-world, persistent occlusions).
| Method | ZJU-MoCap PSNR | OcMotion PSNR | ZJU-MoCap SSIM | OcMotion SSIM | ZJU-MoCap LPIPS* | OcMotion LPIPS* |
|---|---|---|---|---|---|---|
| HumanNeRF | 20.67 | 9.79 | - | - | - | - |
| OccNeRF | 22.40 | 15.71 | - | - | - | - |
| OccFusion (SDS) | 23.96 | 18.28 | - | - | - | - |
| InpaintHuman | 24.65 | 19.02 | - | - | - | - |
LPIPS: reported as LPIPS (lower is better).
Ablation studies (PeopleSnapshot) demonstrated progressive improvements:
- Base (no multi-scale, no textual inversion, no semantic guidance): PSNR=20.05 dB
- Multi-Scale (MS): 22.35 dB
- Textual Inversion (TI): 24.27 dB
- Semantic Guidance (SG): 24.31 dB
Qualitatively, the method yields smoother, more detailed reconstructions than OccNeRF (which suffers from “blotchy” hole artifacts) and OccFusion (which hallucinates colors and fails in clothing texture fidelity) (Fan et al., 5 Jan 2026).
5. Comparison to Related Inpainting Approaches
Earlier facial and human inpainting techniques have leveraged:
- Exemplar- and attribute-guided GANs for facial completion (e.g., EXE-GAN (Lu et al., 2022), Reference-Guided (Yoon et al., 2023)), focusing mainly on 2D or attribute transfer for faces, not general 3D reconstructability or animation.
- Approaches such as EXE-GAN (Lu et al., 2022) employ mixed latent code style modulation, spatially-variant gradient weighting, and adversarial objectives to preserve both perceptual qualities and exemplar-driven attributes, but do not directly produce 3D avatars.
- Foreground-guided methods optimize fidelity via region-specific loss (e.g., (Jam et al., 2021)), but lack the latent-space synthesis and semantic regularization for unseen body parts required in general 3D occlusion scenarios.
Distinctively, InpaintHuman bridges learned 2D UV feature maps, geometric canonicalization, and personalized latent diffusion to enable reconstruction and animation of humans from incomplete monocular observations.
6. Contributions, Limitations, and Future Directions
InpaintHuman’s primary technical contributions are:
- Integration of multi-scale canonical UV encoding to balance occlusion robustness with fine geometry preservation.
- Personalized, identity-preserving latent diffusion inpainting with explicit ControlNet-style body semantics.
- An end-to-end staged optimization pipeline capable of reconstructing fully animatable, subject-faithful human avatars from occluded input video.
A limitation is reliance on accurate SMPL parameter estimates and subject-specific textual inversion, which may require adaptation for diverse ethnicities, clothing styles, or generalized motion beyond the original training domain. Expansion to more general subject types or in-the-wild scenarios with extreme, persistent occlusions may require extensions in semantic control and adaptive canonical parameterization.
7. Summary and Outlook
InpaintHuman represents an overview of coarse-to-fine UV-based geometry, subject-personalized diffusion inpainting, and semantic guidance mechanisms to address occlusion-robust 3D human reconstruction from monocular video. It advances both quantitative and qualitative metrics over prior approaches, providing a platform for future developments in high-fidelity, identity-preserving human digitalization and animation from limited or imperfect real-world observations (Fan et al., 5 Jan 2026).