Papers
Topics
Authors
Recent
2000 character limit reached

InpaintHuman: Animatable 3D Avatars from Video

Updated 12 January 2026
  • InpaintHuman constructs animatable 3D human avatars from heavily occluded monocular video to preserve identity and realistic geometry.
  • It employs a multi-scale UV-parameterized 3D Gaussian framework coupled with personalized diffusion-based inpainting for accurate reconstructions.
  • Performance exceeds state-of-the-art approaches in both synthetic and real-world benchmarks, improving on PSNR and visual fidelity.

InpaintHuman is a two-stage, analysis-by-synthesis framework for reconstructing complete, animatable 3D human avatars from heavily occluded monocular video using a combination of coarse-to-fine UV-parameterized 3D Gaussian splatting and personalized, semantic-guided diffusion-based inpainting. It addresses challenges in faithful geometry recovery and identity preservation under conditions of severe and persistent occlusion, surpassing previous state-of-the-art interpolation- and supervision-guided approaches in both synthetic and real-world benchmarks (Fan et al., 5 Jan 2026).

1. Multi-Scale UV-Parameterized Canonical Representation

InpaintHuman adopts a canonical human body parameterization based on the SMPL mesh model. Every 3D surface point xR3x\in\mathbb{R}^3 is associated with UV coordinates (u,v)[0,1]2(u, v)\in[0,1]^2 via the SMPL template, effectively mapping the human surface to a 2D manifold. Over this manifold, a set of LL multi-scale feature maps F1,...,FL\mathcal{F}_1, ..., \mathcal{F}_L at increasing resolutions (e.g., 64², 128², 256²) is defined.

During canonical representation construction:

  • Each Gaussian sample at xix_i with (ui,vi)(u_i, v_i) receives a bilinearly interpolated feature fl=Fl(ui,vi)f_l = \mathcal{F}_l(u_i, v_i) per scale ll.
  • Canonical features are aggregated as fc=l=1Lflf_c = \sum_{l=1}^L f_l or by a learned blending weight:

F(u,v)=α(u,v)Fc(u,v)+(1α(u,v))Ff(u,v)F(u, v) = \alpha(u, v) F_c(u, v) + (1-\alpha(u, v)) F_f(u, v)

where α(u,v)\alpha(u, v) is a learned per-texel function and FcF_c, FfF_f are coarse and fine-level features.

Optionally, a pose-dependent feature ftf_t is added using a ControlNet-style encoder of the posed SMPL mesh. The combined feature f=fc+ftf = f_c + f_t is decoded via an MLP D\mathcal{D} to predict canonical Gaussian attributes: offset Δμi\Delta\mu_i, scale sis_i, and color cic_i. This multi-scale, hierarchical interpolation enables coarse features to propagate context for large occlusions, while fine scales preserve detailed geometry.

2. Personalized Diffusion-Based Inpainting Module

While UV interpolation propagates features across observed regions, it cannot hallucinate unseen areas (e.g., consistently occluded body parts). InpaintHuman introduces a diffusion-based inpainting module, personalized to the subject, to synthesize missing textures and geometry while tightly preserving subject identity and plausible semantics.

Key components:

  • Textual Inversion: A unique token VV^* is inserted into the text prompt vocabulary, with an embedding vv^* learned such that diffusion denoising, conditioned on vv^*, reconstructs visible subject images. The loss is:

LTI=Ezt,ϵ,tϵϵϕ(zt,t,τψ(V))22\mathcal{L}_{TI} = \mathbb{E}_{z_t, \epsilon, t} \left\|\epsilon - \epsilon_\phi\left(z_t, t, \tau_\psi(V^*)\right)\right\|_2^2

with ztz_t the noised latent, ϵϕ\epsilon_\phi the U-Net denoiser, and τψ\tau_\psi the text encoder.

  • Semantic-Conditioned ControlNet Guidance: For each frame, a semantic part-label map St\mathcal{S}_t is rendered from SMPL at pose θt\theta_t. ControlNet branch C(St)C(\mathcal{S}_t) injects these priors into the diffusion to enforce spatial coherence with the underlying body layout.
  • Joint Inpainting Loss: Self-supervised training randomly masks pixels in visible crops, requiring the system to recover them using a sum of standard diffusion, pixel reconstruction, and optional identity-matching losses:

Linpaint=λdiffLdiff+λpixLpix+λidLid\mathcal{L}_{inpaint} = \lambda_{diff} \mathcal{L}_{diff} + \lambda_{pix} \mathcal{L}_{pix} + \lambda_{id} \mathcal{L}_{id}

with Ldiff\mathcal{L}_{diff} as the denoising+semantic loss, Lpix\mathcal{L}_{pix} a pixel-level L1L_1 loss, and (optionally) Lid\mathcal{L}_{id} a feature-matching identity constraint (e.g., via a face recognition network).

This approach curbs “identity drift” and ensures that the inpainted regions are temporally consistent and subject-specific.

3. End-to-End Algorithmic Pipeline

The InpaintHuman pipeline consists of three major stages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
initialize UV feature maps F_l, decoder D
repeat until convergence:
    for each frame i:
        render Gaussians with current F_l, D -> I_hat_i
        compute L_init = sum_{p in M_vis^i} |I_hat_i(p) - I_i(p)|
    update F_l, D via gradient of L_init

collect visible crops I_i^{vis}, visibility masks M_vis^i
learn text-token V* via L_TI
fine-tune inpainting LoRA parameters to minimize L_inpaint

for each frame i:
    run personalized inpainting -> I_hat_i^{full}, full mask M_full^i
    render UV->I_hat_i^{ref}
    compute L_refine = sum_{p in M_full^i} |I_hat_i^{ref}(p)I_hat_i^{full}(p)| + lambda_ssimSSIM + lambda_lpipsLPIPS
    update F_l, D via gradient of L_refine
(Editor's term: This documentation is an abbreviated form of the paper’s pseudocode; see (Fan et al., 5 Jan 2026) for full specification.)

4. Experimental Evaluation and Benchmarks

InpaintHuman was evaluated on PeopleSnapshot (synthetic occlusions), ZJU-MoCap (central-block masking; 100 frames for training, 22 views for testing), and OcMotion (real-world, persistent occlusions).

Method ZJU-MoCap PSNR OcMotion PSNR ZJU-MoCap SSIM OcMotion SSIM ZJU-MoCap LPIPS* OcMotion LPIPS*
HumanNeRF 20.67 9.79 - - - -
OccNeRF 22.40 15.71 - - - -
OccFusion (SDS) 23.96 18.28 - - - -
InpaintHuman 24.65 19.02 - - - -

LPIPS: reported as 1000×1000 \times LPIPS (lower is better).

Ablation studies (PeopleSnapshot) demonstrated progressive improvements:

  • Base (no multi-scale, no textual inversion, no semantic guidance): PSNR=20.05 dB
    • Multi-Scale (MS): 22.35 dB
    • Textual Inversion (TI): 24.27 dB
    • Semantic Guidance (SG): 24.31 dB

Qualitatively, the method yields smoother, more detailed reconstructions than OccNeRF (which suffers from “blotchy” hole artifacts) and OccFusion (which hallucinates colors and fails in clothing texture fidelity) (Fan et al., 5 Jan 2026).

Earlier facial and human inpainting techniques have leveraged:

  • Exemplar- and attribute-guided GANs for facial completion (e.g., EXE-GAN (Lu et al., 2022), Reference-Guided (Yoon et al., 2023)), focusing mainly on 2D or attribute transfer for faces, not general 3D reconstructability or animation.
  • Approaches such as EXE-GAN (Lu et al., 2022) employ mixed latent code style modulation, spatially-variant gradient weighting, and adversarial objectives to preserve both perceptual qualities and exemplar-driven attributes, but do not directly produce 3D avatars.
  • Foreground-guided methods optimize fidelity via region-specific loss (e.g., (Jam et al., 2021)), but lack the latent-space synthesis and semantic regularization for unseen body parts required in general 3D occlusion scenarios.

Distinctively, InpaintHuman bridges learned 2D UV feature maps, geometric canonicalization, and personalized latent diffusion to enable reconstruction and animation of humans from incomplete monocular observations.

6. Contributions, Limitations, and Future Directions

InpaintHuman’s primary technical contributions are:

  • Integration of multi-scale canonical UV encoding to balance occlusion robustness with fine geometry preservation.
  • Personalized, identity-preserving latent diffusion inpainting with explicit ControlNet-style body semantics.
  • An end-to-end staged optimization pipeline capable of reconstructing fully animatable, subject-faithful human avatars from occluded input video.

A limitation is reliance on accurate SMPL parameter estimates and subject-specific textual inversion, which may require adaptation for diverse ethnicities, clothing styles, or generalized motion beyond the original training domain. Expansion to more general subject types or in-the-wild scenarios with extreme, persistent occlusions may require extensions in semantic control and adaptive canonical parameterization.

7. Summary and Outlook

InpaintHuman represents an overview of coarse-to-fine UV-based geometry, subject-personalized diffusion inpainting, and semantic guidance mechanisms to address occlusion-robust 3D human reconstruction from monocular video. It advances both quantitative and qualitative metrics over prior approaches, providing a platform for future developments in high-fidelity, identity-preserving human digitalization and animation from limited or imperfect real-world observations (Fan et al., 5 Jan 2026).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to InpaintHuman.