Papers
Topics
Authors
Recent
2000 character limit reached

PaintHuman-Style DSD for 3D Texturing

Updated 3 December 2025
  • The paper introduces Denoised Score Distillation (DSD) that refines SDS using negative guidance to overcome over-smoothed textures in 3D human avatars.
  • It integrates depth-map geometric guidance and classifier-free positive/negative predictions to yield sharper, semantically aligned textures.
  • Quantitative metrics and user studies confirm that PaintHuman-Style DSD significantly improves texture quality and material fidelity over conventional methods.

PaintHuman-Style Denoised Score Distillation (DSD) is a methodology for high-fidelity text-to-3D human texturing, specifically addressing the limitations of conventional Score Distillation Sampling (SDS) in producing semantically accurate and geometrically aligned textures on 3D human meshes. The approach, featured in the PaintHuman framework, introduces a denoised score distillation objective that integrates negative guidance and geometric depth conditioning into the generative process. This innovation yields improvements in texture quality, semantic consistency, and geometric detail, validated by both quantitative metrics and user studies (Yu et al., 2023).

1. Background: Standard Score Distillation Sampling (SDS)

Score Distillation Sampling (SDS) serves as a foundational mechanism for zero-shot text-to-3D generation, leveraging pre-trained text-to-image diffusion models such as Stable Diffusion. Given a mesh with 3D texture parameters θ\theta, an image xx is rendered as x=gθ(mesh)x = g_\theta(\text{mesh}), which is encoded into a latent representation zz. The SDS process employs text prompt embeddings yy (CLIP/text), Gaussian noise ϵN(0,I)\epsilon\sim\mathcal N(0,I), and a random timestep tt sampled uniformly. Noise predictions ϵϕ(zt,y,t)\epsilon_\phi(z_t, y, t) are produced via a U-Net, optionally enhanced with classifier-free guidance: ϵ^ϕ(zt,y,t)=(1+ω)ϵϕ(zt,y,t)ωϵϕ(zt,,t).\hat\epsilon_\phi(z_t, y, t) = (1+\omega)\epsilon_\phi(z_t, y, t) - \omega\epsilon_\phi(z_t, \varnothing, t). The diffusion loss per timestep is

LDiff(z,y,t)=w(t)ϵϕ(zt,y,t)ϵ22,\mathcal L_{\rm Diff}(z, y, t) = w(t)\|\epsilon_\phi(z_t, y, t)-\epsilon\|^2_2,

with the SDS gradient (excluding the U-Net Jacobian) given by

θLSDS=w(t)(ϵ^ϕ(zt,y,t)ϵ)ztθ.\nabla_\theta \mathcal L_{\rm SDS} = w(t)\Bigl(\hat\epsilon_\phi(z_t, y, t)-\epsilon\Bigr)\frac{\partial z_t}{\partial\theta}.

However, direct application of SDS in human avatar texturing often produces over-smoothed textures and poor semantic alignment with mesh geometry, motivating the refinement proposed in PaintHuman.

2. Denoised Score Distillation (DSD) Formulation

Denoised Score Distillation augments the SDS objective by incorporating a negative guidance term at each training iteration. This counteracts the tendency of SDS gradients to over-smooth textures and enables iterative correction towards greater semantic and geometric fidelity. At step ii:

  • ztiz_t^i: current image latent at timestep tt;
  • z^ti1\hat z_t^{\,i-1}: previous iteration's latent, used as a negative image;
  • yy: positive text embedding;
  • y^\hat y: negative text embedding (such as "ugly, disfigured");
  • λ>0\lambda>0: weighting factor for the negative term.

The DSD loss at each timestep is: LDSD=w(t)ϵϕ(zti,y,t)ϵ22λϵϕ(z^ti1,y^,t)ϵ22,\mathcal L_{\rm DSD} = w(t)\|\epsilon_\phi(z^i_t, y, t)-\epsilon\|^2_2 - \lambda\|\epsilon_\phi(\hat z^{\,i-1}_t, \hat y, t)-\epsilon\|^2_2, yielding a gradient (omitting the Jacobian) as follows: θLDSD=w(t)(ϵ^ϕ(zt,y,t)λϵ^ϕ(z^t,y^,t)(1λ)ϵ)ztθ.\nabla_\theta \mathcal L_{\rm DSD} = w(t)\Bigl(\hat\epsilon_\phi(z_t, y, t) - \lambda\hat\epsilon_\phi(\hat z_t, \hat y, t) - (1-\lambda)\epsilon\Bigr)\frac{\partial z_t}{\partial\theta}. This gradient supports texture optimization that is simultaneously sharper and semantically aligned to the text given both positive and negative guidance.

3. Training Procedure and Implementation

The workflow for PaintHuman-Style DSD is structured as a multi-iteration optimization over the SV-BRDF network parameters θ\theta, using both positive and negative text and image pairs. The key algorithmic steps can be summarized as:

  1. Render current image and depth: xix^i, depth=Render(mesh,SV-BRDF(θ),camera)\text{depth} = \text{Render}(\text{mesh}, \text{SV-BRDF}(\theta), \text{camera}).
  2. Encode to latent: zi=Encoder(xi)z^i = \text{Encoder}(x^i).
  3. Choose negative pair: If i>1i>1, sample noise and negative prompt embedding; else, use warm-up.
  4. Sample timestep tt and noise ϵ\epsilon.
  5. Compute classifier-free guided positive and negative predictions.
  6. Form DSD gradient and update θ\theta via backpropagation.
  7. Optionally adjust the camera view for semantic zoom.

The following pseudocode strictly reflects the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Input:
 mesh (fixed topology), text prompt y, negative prompt pool ŷ, λ, total iters N,
 pre-trained StableDiffusion U-Net ε_φ, depth-conditioned variant,
 geometry-to-latent renderer g_θ (θ includes SV-BRDF network parameters)

Initialize:
 θ  random SV-BRDF network weights (albedo k_d, roughness r, metallic m)
 carry over a buffer for negative latent ŷ_z  None

for i in 1..N do
  // 1) Render current image & depth
  x^i, depth = Render(mesh, SV-BRDF(θ), camera)
  // 2) Encode to latent
  z^i  Encoder(x^i)
  // 3) Choose negative pair
  if i>1 then
    ŷ_z_t  noise_latent(z^{i-1}, shared t, shared ε)
    ŷ_embedding  random choice from ŷ
  else
    ŷ_z_t  z^i_t  // warm-up, no subtraction
    ŷ_embedding  blank
  end
  // 4) Sample t, ε
  t  Uniform, ε  N(0,I)
  // 5) Compute guided predictions
  ẑ_pos  CFG(ε_φ, z^i_t, y, t)
  ẑ_neg  CFG(ε_φ, ŷ_z_t, ŷ_embedding, t)
  // 6) Form DSD gradient
  grad = w(t) * ( (ẑ_pos - ε) - λ*(ẑ_neg - ε) )
  // backprop through z^i_t  g_θ(mesh) to update θ
  θ  θ - lr * _θ[ grad · (z^i_t/θ ) ]
  // 7) Optionally adjust camera for semantic zoom every K steps
  if i mod K == 0 then
    camera  zoom_to_face(mesh)
  end
end for

Output: SV-BRDF network θ defining final textures

4. Depth-Map Geometric Guidance

PaintHuman introduces geometric guidance via depth maps, leveraging the depth-to-image variant of Stable Diffusion rather than the standard unconditional U-Net. For each rendered image, a depth map D=DepthRender(mesh,camera)D = \text{DepthRender}(\text{mesh}, \text{camera}) is produced and concatenated with the latent ztz_t, serving as an additional conditioning channel for the denoising network ϵϕ\epsilon_\phi. No explicit loss term for depth is required; the model's learned depth-to-image prior ensures that local geometry (e.g., cloth wrinkles, accessories) is accurately transferred onto the mesh. This integration strengthens semantic and geometric alignment in the synthesis process.

5. Geometry-Aware SV-BRDF Network

Surface Appearance Representation is realized via a spatially-varying BRDF (SV-BRDF) parameterization. Each query point xpx_p in mesh space is input to a small coordinate-based MLP γ(xp)\gamma(x_p), followed by a secondary MLP σ()\sigma(\cdot) mapping to surface material parameters:

  • [kd(xp),r(xp),m(xp)][k_d(x_p), r(x_p), m(x_p)] (diffuse albedo, roughness, metallic), with each network implemented using 32\approx32 hidden units. Rendering adheres to a simplified physically-based integral: R(xp)=ln0Li(l)(fd+fs)(l,v)(ln)dlR(x_p) = \int_{l\cdot n\geq 0} L_i(l)\cdot(f_d + f_s)(l, v)\cdot(l\cdot n)\, dl where
  • fd(xp)=kd(xp)/πf_d(x_p) = k_d(x_p)/\pi
  • fs(l,v)f_s(l, v) uses Disney BRDF parameterization,
  • ks=mkd+(1m)0.04k_s = m\cdot k_d + (1-m)\cdot 0.04.

A differentiable split-sum approximation of the specular term is utilized to maintain tractability for R/θ\partial R/\partial \theta. A small total-variation or smoothness prior is applied on kdk_d to encourage local coherence in diffuse albedo.

6. Quantitative and Qualitative Evaluation

PaintHuman’s DSD-driven methodology demonstrates superior performance in both automated metrics and user preference studies. The following results, presented verbatim, encapsulate the comparative findings.

Mean CLIP Score Over Six Views:

Method Score Improvement (Δ)
Latent-Paint 24.11 19.99%
TEXTure 25.34 14.17%
Fantasia3D 27.10 6.75%
PaintHuman 28.93
DreamHuman 25.79 12.25%
PaintHuman (same prompt) 28.95

User Study (Score 1–5):

Method Score Improvement (Δ)
Latent-Paint 1.21 ± 0.70 148.8%
TEXTure 1.28 ± 0.60 135.2%
Fantasia3D 1.76 ± 0.70 71.0%
DreamHuman 2.83 ± 0.82 6.4%
PaintHuman 3.01 ± 0.95

PaintHuman yields markedly sharper garment folds, discernible accessory detailing (belt, tie), and semantically correct color placements. SDS-only outputs typically exhibit over-smoothed textures; integrating either depth or negative prompts alone mainly enhances local fidelity but does not fully recover semantic correctness. Only PaintHuman’s DSD alignment achieves full fidelity and geometric adherence.

7. Limitations and Outlook

Current usage requires manual curation of negative prompts (e.g., “bad hands”, “ugly”), which might be mitigated by automated negative-prompt mining for improved DSD stability. The method roughly doubles the per-step cost of U-Net inversion due to computing both positive and negative guidance; research into more efficient approximations could lower runtime. Occasional texture seams persist in highly complex clothing (e.g., multilevel lace and prints), suggesting that additional geometric cues (normal or curvature, beyond depth) might be beneficial. The mesh topology remains fixed; enabling local displacement or mesh refinement could enhance representation of fine details, such as wrinkles. Extension to dynamic, nonrigid garments via differentiable cloth simulation remains an active research avenue.

DSD refines vanilla SDS gradients by subtracting a negative image–text pair term, which, in tandem with depth-aware diffusion and SV-BRDF shading, produces high-fidelity 3D human avatar textures with improved semantic and geometric quality compared to previous approaches (Yu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PaintHuman-Style DSD.