SMPLitex: 3D Human Texture Diffusion
- SMPLitex is a generative diffusion framework that estimates and edits complete 3D human textures from a single image using SMPL and DensePose mapping.
- It employs a streamlined, one-shot inpainting pipeline, leveraging latent diffusion and a curated dataset to support text-driven and structural UV texture manipulation.
- Experimental results demonstrate superior SSIM and LPIPS performance compared to state-of-the-art methods, ensuring high-fidelity texture synthesis.
SMPLitex is a generative diffusion-based framework and dataset for complete 3D human texture estimation and manipulation from a single image. It integrates recent advancements in latent diffusion modeling and 3D body modeling via SMPL, enabling high-fidelity synthesis, editing, and inpainting of UV-mapped textures directly associated with the SMPL mesh topology. SMPLitex introduces a methodologically streamlined pipeline and a curated dataset of high-quality 3D human textures, with evaluations demonstrating substantial improvement over prior state-of-the-art in both cross-view reconstruction accuracy and support for diverse text-driven and structural editing tasks (Casas et al., 2023).
1. Model Architecture and Integration with SMPL
The SMPLitex system adopts a latent diffusion model (LDM) backbone as described by Rombach et al. (2022), omitting adversarial components such as GAN discriminators and relying entirely on the diffusion paradigm for both training and inference. Within this architecture, a frozen variational autoencoder encoder projects 2D (or UV) images into a spatial latent variable . While exact dimensions are not stated, reference to Stable Diffusion v1-4 implies a downsampling of images to a latent representation.
The generative process is orchestrated by a time-conditional U-Net operating over , where is the noisy latent, is the context encoding (textual description and/or partial UV observation), and is the diffusion timestep.
SMPL integration is achieved by leveraging the SMPL parametric body model (pose and shape), combined with DensePose correspondences for pixel-to-UV mapping and a silhouette mask to extract visible subject regions. The partial UV map is then assembled by projecting masked image pixels into UV space:
where denotes element-wise product and distributes image colors into corresponding UV bins.
The generative diffusion is guided by this (either concatenated or through cross-attention) as a conditioning signal, directing inpainting of missing texels to complete the full SMPL UV texture.
2. Training Objectives and Optimization
SMPLitex optimization follows the standard LDM denoising loss, specifically the noise prediction objective:
No adversarial, explicit reconstruction, perceptual (VGG), or additional regularization losses are reported beyond this denoising term. For transfer from a general image LDM to UV-mapped textures, the authors supplement training with the “prior-preservation” loss as in DreamBooth (Ruiz et al. 2023), mitigating catastrophic forgetting, although its formula is not explicitly provided.
3. Dataset Composition and Sampling Methodology
The fine-tuning phase uses a dataset of 10 high-quality UV texture maps from earlier SMPL reconstruction studies (Alldieck et al. 2018; Lazova et al. 2019). These serve as targets to steer the pretrained latent diffusion backbone toward learning SMPL-style UV parametrizations.
The resulting SMPLitex dataset comprises 100 curated and diversified human UV textures, each generated via classifier-free diffusion sampling (guidance scale 2.0, 50 denoising steps) and paired with textual prompts describing specific clothing, accessory, and identity attributes. All textures are standardized to UV resolution. No explicit closed-form data distribution is specified, implying sampling is performed in a prompt-driven, controlled-but-diverse fashion.
4. One-Shot 3D Texture Fitting Pipeline
Given a single input image , the pipeline proceeds as follows:
- Detect a 2D person, estimate SMPL pose and shape .
- Compute DensePose correspondences for visible pixels and generate a binary silhouette mask .
- Assemble the partial UV map: .
- Condition the LDM inpainting model on to sample completions for missing or occluded texels:
- Sample ,
- Iteratively denoise:
- Decode the final latent with the VAE decoder :
This procedure constitutes a pure feed-forward inpainting process using LDM sampling; no optimization or iterative backpropagation over inputs is employed at inference.
5. Experimental Results and Quantitative Performance
SMPLitex’s performance is evaluated across three benchmarks: Market-1501 (cross-view re-rendering, images), THUman2.0 (multi-view render-from-scan), and DeepFashion-MultiModal (qualitative assessment on images).
| Dataset | Metric | TexFormer | CMR | HPBTT | RSTG | TexGlo | SMPLitex |
|---|---|---|---|---|---|---|---|
| Market-1501 | SSIM | 0.7422 | 0.7142 | 0.7420 | 0.6735 | 0.6658 | 0.8648 (+0.12) |
| LPIPS (↓) | 0.1154 | 0.1275 | 0.1168 | 0.1778 | 0.1776 | 0.0695 (–0.0459) | |
| THUman2.0 | SSIM | 0.8761 | – | – | – | – | 0.8829 (+0.0068) |
| LPIPS (↓) | 0.1223 | – | – | – | – | 0.1067 (–0.0156) |
In Market-1501, SMPLitex surpasses all baselines by 0.12 in SSIM and reduces the perceptual similarity error (LPIPS) by –0.0459 versus the strongest comparator. On THUman2.0, gains relative to TexFormer are similarly observed. DeepFashion-MultiModal evaluation highlights superior recovery of high-frequency details (e.g., garment wrinkles, facial anisotropy) not matched by GAN-based methods.
Qualitative examination reveals artifact-free novel view synthesis, plausible completions in occluded regions, and sharp attribute transfer across diverse poses and body shapes.
6. Applications and Capabilities
SMPLitex enables a range of editing and synthesis tasks:
- Partial-mask inpainting: Arbitrary UV regions in can be replaced and inpainted, supporting localized editing such as color changes or logo addition.
- Novel-view synthesis: Completed textures mapped onto are rendered from arbitrary viewpoints, consistently across pose sequences due to the SMPL UV parameterization.
- Text-driven attribute manipulation: The LDM backbone allows sampling conditioned on text prompts, supporting synthesis of textures with user-defined clothing, appearance, or accessories. Interpolation in CLIP-embedding or diffusion latent space facilitates smooth morphing between attribute sets.
- Latent-space editing: For prompt embeddings , (text prompts , ), forming enables synthesis of intermediate textures with interpolated attributes.
7. Contributions and Release
SMPLitex’s principal contributions are:
- A fine-tuned latent diffusion backbone that natively generates high-fidelity, fully-differentiable SMPL UV textures.
- A one-shot single-image fitting pipeline requiring only DensePose correspondences and silhouette masking, allowing high-quality texture completion for previously unseen subjects.
- Public release of both 100 curated SMPL-mapped textures (standardized to UV format) and the trained diffusion model, thus providing reusable assets for downstream research in human texture synthesis, text-driven editing, and 3D avatar creation (Casas et al., 2023).