Papers
Topics
Authors
Recent
Search
2000 character limit reached

SMPLitex: 3D Human Texture Diffusion

Updated 10 March 2026
  • SMPLitex is a generative diffusion framework that estimates and edits complete 3D human textures from a single image using SMPL and DensePose mapping.
  • It employs a streamlined, one-shot inpainting pipeline, leveraging latent diffusion and a curated dataset to support text-driven and structural UV texture manipulation.
  • Experimental results demonstrate superior SSIM and LPIPS performance compared to state-of-the-art methods, ensuring high-fidelity texture synthesis.

SMPLitex is a generative diffusion-based framework and dataset for complete 3D human texture estimation and manipulation from a single image. It integrates recent advancements in latent diffusion modeling and 3D body modeling via SMPL, enabling high-fidelity synthesis, editing, and inpainting of UV-mapped textures directly associated with the SMPL mesh topology. SMPLitex introduces a methodologically streamlined pipeline and a curated dataset of high-quality 3D human textures, with evaluations demonstrating substantial improvement over prior state-of-the-art in both cross-view reconstruction accuracy and support for diverse text-driven and structural editing tasks (Casas et al., 2023).

1. Model Architecture and Integration with SMPL

The SMPLitex system adopts a latent diffusion model (LDM) backbone as described by Rombach et al. (2022), omitting adversarial components such as GAN discriminators and relying entirely on the diffusion paradigm for both training and inference. Within this architecture, a frozen variational autoencoder encoder E\mathcal{E} projects 2D (or UV) images into a spatial latent variable zRh×w×cz \in \mathbb{R}^{h \times w \times c}. While exact dimensions are not stated, reference to Stable Diffusion v1-4 implies a downsampling of 512×512512 \times 512 images to a 64×64×464 \times 64 \times 4 latent representation.

The generative process is orchestrated by a time-conditional U-Net ϵθ\epsilon_\theta operating over (zt,c,t)(z_t, c, t), where ztz_t is the noisy latent, cc is the context encoding (textual description and/or partial UV observation), and tt is the diffusion timestep.

SMPL integration is achieved by leveraging the SMPL parametric body model M(θ,β)M(\theta, \beta) (pose and shape), combined with DensePose correspondences d(p)d(p) for pixel-to-UV mapping and a silhouette mask s(p)s(p) to extract visible subject regions. The partial UV map upartu_{\text{part}} is then assembled by projecting masked image pixels into UV space:

upart=Π(x,ds)u_{\text{part}} = \Pi(x, d \odot s)

where \odot denotes element-wise product and Π\Pi distributes image colors into corresponding UV bins.

The generative diffusion is guided by this upartu_{\text{part}} (either concatenated or through cross-attention) as a conditioning signal, directing inpainting of missing texels to complete the full SMPL UV texture.

2. Training Objectives and Optimization

SMPLitex optimization follows the standard LDM denoising loss, specifically the L2L_2 noise prediction objective:

Ldiff=Eztq(ztz0),t,c,ϵN(0,1)[ϵϵθ(zt,c,t)22]\mathcal{L}_{\text{diff}} = \mathbb{E}_{z_t\sim q(z_t|z_0),\,t,\,c,\,\epsilon\sim\mathcal{N}(0,1)} \left[ \|\epsilon - \epsilon_\theta(z_t,c,t)\|_2^2 \right]

No adversarial, explicit L1/L2L_1/L_2 reconstruction, perceptual (VGG), or additional regularization losses are reported beyond this denoising term. For transfer from a general image LDM to UV-mapped textures, the authors supplement training with the “prior-preservation” loss as in DreamBooth (Ruiz et al. 2023), mitigating catastrophic forgetting, although its formula is not explicitly provided.

3. Dataset Composition and Sampling Methodology

The fine-tuning phase uses a dataset of 10 high-quality UV texture maps from earlier SMPL reconstruction studies (Alldieck et al. 2018; Lazova et al. 2019). These serve as targets to steer the pretrained latent diffusion backbone toward learning SMPL-style UV parametrizations.

The resulting SMPLitex dataset comprises 100 curated and diversified human UV textures, each generated via classifier-free diffusion sampling (guidance scale 2.0, 50 denoising steps) and paired with textual prompts describing specific clothing, accessory, and identity attributes. All textures are standardized to 512×512512\times512 UV resolution. No explicit closed-form data distribution is specified, implying sampling is performed in a prompt-driven, controlled-but-diverse fashion.

4. One-Shot 3D Texture Fitting Pipeline

Given a single input image xx, the pipeline proceeds as follows:

  1. Detect a 2D person, estimate SMPL pose θ\theta and shape β\beta.
  2. Compute DensePose correspondences d(p)d(p) for visible pixels and generate a binary silhouette mask s(p)s(p).
  3. Assemble the partial UV map: upart=Π(x,ds)u_{\text{part}} = \Pi(x, d \odot s).
  4. Condition the LDM inpainting model on upartu_{\text{part}} to sample completions for missing or occluded texels:
  • Sample zTN(0,I)z_T \sim \mathcal{N}(0, I),
  • Iteratively denoise:

    zt1pθ(zt1zt,upart,t)z_{t-1} \leftarrow p_\theta(z_{t-1} \mid z_t,\, u_{\text{part}},\, t)

  • Decode the final latent with the VAE decoder DD:

    ufull=D(z0)u_{\text{full}} = D(z_0)

This procedure constitutes a pure feed-forward inpainting process using LDM sampling; no optimization or iterative backpropagation over inputs is employed at inference.

5. Experimental Results and Quantitative Performance

SMPLitex’s performance is evaluated across three benchmarks: Market-1501 (cross-view re-rendering, 64×12864\times128 images), THUman2.0 (multi-view render-from-scan), and DeepFashion-MultiModal (qualitative assessment on 750×1101750 \times 1101 images).

Dataset Metric TexFormer CMR HPBTT RSTG TexGlo SMPLitex
Market-1501 SSIM 0.7422 0.7142 0.7420 0.6735 0.6658 0.8648 (+0.12)
LPIPS (↓) 0.1154 0.1275 0.1168 0.1778 0.1776 0.0695 (–0.0459)
THUman2.0 SSIM 0.8761 0.8829 (+0.0068)
LPIPS (↓) 0.1223 0.1067 (–0.0156)

In Market-1501, SMPLitex surpasses all baselines by 0.12 in SSIM and reduces the perceptual similarity error (LPIPS) by –0.0459 versus the strongest comparator. On THUman2.0, gains relative to TexFormer are similarly observed. DeepFashion-MultiModal evaluation highlights superior recovery of high-frequency details (e.g., garment wrinkles, facial anisotropy) not matched by GAN-based methods.

Qualitative examination reveals artifact-free novel view synthesis, plausible completions in occluded regions, and sharp attribute transfer across diverse poses and body shapes.

6. Applications and Capabilities

SMPLitex enables a range of editing and synthesis tasks:

  • Partial-mask inpainting: Arbitrary UV regions in upartu_{\text{part}} can be replaced and inpainted, supporting localized editing such as color changes or logo addition.
  • Novel-view synthesis: Completed ufullu_{\text{full}} textures mapped onto M(θ,β)M(\theta, \beta) are rendered from arbitrary viewpoints, consistently across pose sequences due to the SMPL UV parameterization.
  • Text-driven attribute manipulation: The LDM backbone allows sampling conditioned on text prompts, supporting synthesis of textures with user-defined clothing, appearance, or accessories. Interpolation in CLIP-embedding or diffusion latent space facilitates smooth morphing between attribute sets.
  • Latent-space editing: For prompt embeddings e1e_1, e2e_2 (text prompts t1t_1, t2t_2), forming eα=(1α)e1+αe2e_\alpha = (1-\alpha)e_1 + \alpha e_2 enables synthesis of intermediate textures with interpolated attributes.

7. Contributions and Release

SMPLitex’s principal contributions are:

  • A fine-tuned latent diffusion backbone that natively generates high-fidelity, fully-differentiable SMPL UV textures.
  • A one-shot single-image fitting pipeline requiring only DensePose correspondences and silhouette masking, allowing high-quality texture completion for previously unseen subjects.
  • Public release of both 100 curated SMPL-mapped textures (standardized to 512×512512\times512 UV format) and the trained diffusion model, thus providing reusable assets for downstream research in human texture synthesis, text-driven editing, and 3D avatar creation (Casas et al., 2023).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SMPLitex.