InpaintHuman: Animatable 3D Avatars from Video

Updated 12 January 2026

InpaintHuman constructs animatable 3D human avatars from heavily occluded monocular video to preserve identity and realistic geometry.
It employs a multi-scale UV-parameterized 3D Gaussian framework coupled with personalized diffusion-based inpainting for accurate reconstructions.
Performance exceeds state-of-the-art approaches in both synthetic and real-world benchmarks, improving on PSNR and visual fidelity.

InpaintHuman is a two-stage, analysis-by-synthesis framework for reconstructing complete, animatable 3D human avatars from heavily occluded monocular video using a combination of coarse-to-fine UV-parameterized 3D Gaussian splatting and personalized, semantic-guided diffusion-based inpainting. It addresses challenges in faithful geometry recovery and identity preservation under conditions of severe and persistent occlusion, surpassing previous state-of-the-art interpolation- and supervision-guided approaches in both synthetic and real-world benchmarks (Fan et al., 5 Jan 2026).

1. Multi-Scale UV-Parameterized Canonical Representation

InpaintHuman adopts a canonical human body parameterization based on the SMPL mesh model. Every 3D surface point $x\in\mathbb{R}^3$ is associated with UV coordinates $(u, v)\in[0,1]^2$ via the SMPL template, effectively mapping the human surface to a 2D manifold. Over this manifold, a set of $L$ multi-scale feature maps $\mathcal{F}_1, ..., \mathcal{F}_L$ at increasing resolutions (e.g., 64², 128², 256²) is defined.

During canonical representation construction:

Each Gaussian sample at $x_i$ with $(u_i, v_i)$ receives a bilinearly interpolated feature $f_l = \mathcal{F}_l(u_i, v_i)$ per scale $l$ .
Canonical features are aggregated as $f_c = \sum_{l=1}^L f_l$ or by a learned blending weight:

$F(u, v) = \alpha(u, v) F_c(u, v) + (1-\alpha(u, v)) F_f(u, v)$

where $\alpha(u, v)$ is a learned per-texel function and $F_c$ , $F_f$ are coarse and fine-level features.

Optionally, a pose-dependent feature $f_t$ is added using a ControlNet-style encoder of the posed SMPL mesh. The combined feature $f = f_c + f_t$ is decoded via an MLP $\mathcal{D}$ to predict canonical Gaussian attributes: offset $\Delta\mu_i$ , scale $s_i$ , and color $c_i$ . This multi-scale, hierarchical interpolation enables coarse features to propagate context for large occlusions, while fine scales preserve detailed geometry.

2. Personalized Diffusion-Based Inpainting Module

While UV interpolation propagates features across observed regions, it cannot hallucinate unseen areas (e.g., consistently occluded body parts). InpaintHuman introduces a diffusion-based inpainting module, personalized to the subject, to synthesize missing textures and geometry while tightly preserving subject identity and plausible semantics.

Key components:

Textual Inversion: A unique token $V^*$ is inserted into the text prompt vocabulary, with an embedding $v^*$ learned such that diffusion denoising, conditioned on $v^*$ , reconstructs visible subject images. The loss is:

$\mathcal{L}_{TI} = \mathbb{E}_{z_t, \epsilon, t} \left\|\epsilon - \epsilon_\phi\left(z_t, t, \tau_\psi(V^*)\right)\right\|_2^2$

with $z_t$ the noised latent, $\epsilon_\phi$ the U-Net denoiser, and $\tau_\psi$ the text encoder.

Semantic-Conditioned ControlNet Guidance: For each frame, a semantic part-label map $\mathcal{S}_t$ is rendered from SMPL at pose $\theta_t$ . ControlNet branch $C(\mathcal{S}_t)$ injects these priors into the diffusion to enforce spatial coherence with the underlying body layout.
Joint Inpainting Loss: Self-supervised training randomly masks pixels in visible crops, requiring the system to recover them using a sum of standard diffusion, pixel reconstruction, and optional identity-matching losses:

$\mathcal{L}_{inpaint} = \lambda_{diff} \mathcal{L}_{diff} + \lambda_{pix} \mathcal{L}_{pix} + \lambda_{id} \mathcal{L}_{id}$

with $\mathcal{L}_{diff}$ as the denoising+semantic loss, $\mathcal{L}_{pix}$ a pixel-level $L_1$ loss, and (optionally) $\mathcal{L}_{id}$ a feature-matching identity constraint (e.g., via a face recognition network).

This approach curbs “identity drift” and ensures that the inpainted regions are temporally consistent and subject-specific.

3. End-to-End Algorithmic Pipeline

The InpaintHuman pipeline consists of three major stages:

initialize UV feature maps F_l, decoder D
repeat until convergence:
    for each frame i:
        render Gaussians with current F_l, D -> I_hat_i
        compute L_init = sum_{p in M_vis^i} |I_hat_i(p) - I_i(p)|
    update F_l, D via gradient of L_init

collect visible crops I_i^{vis}, visibility masks M_vis^i
learn text-token V* via L_TI
fine-tune inpainting LoRA parameters to minimize L_inpaint

for each frame i:
    run personalized inpainting -> I_hat_i^{full}, full mask M_full^i
    render UV->I_hat_i^{ref}
    compute L_refine = sum_{p in M_full^i} |I_hat_i^{ref}(p)−I_hat_i^{full}(p)| + lambda_ssim⋅SSIM + lambda_lpips⋅LPIPS
    update F_l, D via gradient of L_refine

(Editor's term: This documentation is an abbreviated form of the paper’s pseudocode; see (Fan et al., 5 Jan 2026) for full specification.)

4. Experimental Evaluation and Benchmarks

InpaintHuman was evaluated on PeopleSnapshot (synthetic occlusions), ZJU-MoCap (central-block masking; 100 frames for training, 22 views for testing), and OcMotion (real-world, persistent occlusions).

Method	ZJU-MoCap PSNR	OcMotion PSNR	ZJU-MoCap SSIM	OcMotion SSIM	ZJU-MoCap LPIPS*	OcMotion LPIPS*
HumanNeRF	20.67	9.79	-	-	-	-
OccNeRF	22.40	15.71	-	-	-	-
OccFusion (SDS)	23.96	18.28	-	-	-	-
InpaintHuman	24.65	19.02	-	-	-	-

LPIPS: reported as $1000 \times$ LPIPS (lower is better).

Ablation studies (PeopleSnapshot) demonstrated progressive improvements:

Base (no multi-scale, no textual inversion, no semantic guidance): PSNR=20.05 dB
- Multi-Scale (MS): 22.35 dB
- Textual Inversion (TI): 24.27 dB
- Semantic Guidance (SG): 24.31 dB

Qualitatively, the method yields smoother, more detailed reconstructions than OccNeRF (which suffers from “blotchy” hole artifacts) and OccFusion (which hallucinates colors and fails in clothing texture fidelity) (Fan et al., 5 Jan 2026).

Earlier facial and human inpainting techniques have leveraged:

Exemplar- and attribute-guided GANs for facial completion (e.g., EXE-GAN (Lu et al., 2022), Reference-Guided (Yoon et al., 2023)), focusing mainly on 2D or attribute transfer for faces, not general 3D reconstructability or animation.
Approaches such as EXE-GAN (Lu et al., 2022) employ mixed latent code style modulation, spatially-variant gradient weighting, and adversarial objectives to preserve both perceptual qualities and exemplar-driven attributes, but do not directly produce 3D avatars.
Foreground-guided methods optimize fidelity via region-specific loss (e.g., (Jam et al., 2021)), but lack the latent-space synthesis and semantic regularization for unseen body parts required in general 3D occlusion scenarios.

Distinctively, InpaintHuman bridges learned 2D UV feature maps, geometric canonicalization, and personalized latent diffusion to enable reconstruction and animation of humans from incomplete monocular observations.

6. Contributions, Limitations, and Future Directions

InpaintHuman’s primary technical contributions are:

Integration of multi-scale canonical UV encoding to balance occlusion robustness with fine geometry preservation.
Personalized, identity-preserving latent diffusion inpainting with explicit ControlNet-style body semantics.
An end-to-end staged optimization pipeline capable of reconstructing fully animatable, subject-faithful human avatars from occluded input video.

A limitation is reliance on accurate SMPL parameter estimates and subject-specific textual inversion, which may require adaptation for diverse ethnicities, clothing styles, or generalized motion beyond the original training domain. Expansion to more general subject types or in-the-wild scenarios with extreme, persistent occlusions may require extensions in semantic control and adaptive canonical parameterization.

7. Summary and Outlook

InpaintHuman represents an overview of coarse-to-fine UV-based geometry, subject-personalized diffusion inpainting, and semantic guidance mechanisms to address occlusion-robust 3D human reconstruction from monocular video. It advances both quantitative and qualitative metrics over prior approaches, providing a platform for future developments in high-fidelity, identity-preserving human digitalization and animation from limited or imperfect real-world observations (Fan et al., 5 Jan 2026).