PaintHuman-Style DSD for 3D Texturing

Updated 3 December 2025

The paper introduces Denoised Score Distillation (DSD) that refines SDS using negative guidance to overcome over-smoothed textures in 3D human avatars.
It integrates depth-map geometric guidance and classifier-free positive/negative predictions to yield sharper, semantically aligned textures.
Quantitative metrics and user studies confirm that PaintHuman-Style DSD significantly improves texture quality and material fidelity over conventional methods.

PaintHuman-Style Denoised Score Distillation (DSD) is a methodology for high-fidelity text-to-3D human texturing, specifically addressing the limitations of conventional Score Distillation Sampling (SDS) in producing semantically accurate and geometrically aligned textures on 3D human meshes. The approach, featured in the PaintHuman framework, introduces a denoised score distillation objective that integrates negative guidance and geometric depth conditioning into the generative process. This innovation yields improvements in texture quality, semantic consistency, and geometric detail, validated by both quantitative metrics and user studies (Yu et al., 2023).

1. Background: Standard Score Distillation Sampling (SDS)

Score Distillation Sampling (SDS) serves as a foundational mechanism for zero-shot text-to-3D generation, leveraging pre-trained text-to-image diffusion models such as Stable Diffusion. Given a mesh with 3D texture parameters $\theta$ , an image $x$ is rendered as $x = g_\theta(\text{mesh})$ , which is encoded into a latent representation $z$ . The SDS process employs text prompt embeddings $y$ (CLIP/text), Gaussian noise $\epsilon\sim\mathcal N(0,I)$ , and a random timestep $t$ sampled uniformly. Noise predictions $\epsilon_\phi(z_t, y, t)$ are produced via a U-Net, optionally enhanced with classifier-free guidance: $\hat\epsilon_\phi(z_t, y, t) = (1+\omega)\epsilon_\phi(z_t, y, t) - \omega\epsilon_\phi(z_t, \varnothing, t).$ The diffusion loss per timestep is

$\mathcal L_{\rm Diff}(z, y, t) = w(t)\|\epsilon_\phi(z_t, y, t)-\epsilon\|^2_2,$

with the SDS gradient (excluding the U-Net Jacobian) given by

$\nabla_\theta \mathcal L_{\rm SDS} = w(t)\Bigl(\hat\epsilon_\phi(z_t, y, t)-\epsilon\Bigr)\frac{\partial z_t}{\partial\theta}.$

However, direct application of SDS in human avatar texturing often produces over-smoothed textures and poor semantic alignment with mesh geometry, motivating the refinement proposed in PaintHuman.

2. Denoised Score Distillation (DSD) Formulation

Denoised Score Distillation augments the SDS objective by incorporating a negative guidance term at each training iteration. This counteracts the tendency of SDS gradients to over-smooth textures and enables iterative correction towards greater semantic and geometric fidelity. At step $i$ :

$z_t^i$ : current image latent at timestep $t$ ;
$\hat z_t^{\,i-1}$ : previous iteration's latent, used as a negative image;
$y$ : positive text embedding;
$\hat y$ : negative text embedding (such as "ugly, disfigured");
$\lambda>0$ : weighting factor for the negative term.

The DSD loss at each timestep is: $\mathcal L_{\rm DSD} = w(t)\|\epsilon_\phi(z^i_t, y, t)-\epsilon\|^2_2 - \lambda\|\epsilon_\phi(\hat z^{\,i-1}_t, \hat y, t)-\epsilon\|^2_2,$ yielding a gradient (omitting the Jacobian) as follows: $\nabla_\theta \mathcal L_{\rm DSD} = w(t)\Bigl(\hat\epsilon_\phi(z_t, y, t) - \lambda\hat\epsilon_\phi(\hat z_t, \hat y, t) - (1-\lambda)\epsilon\Bigr)\frac{\partial z_t}{\partial\theta}.$ This gradient supports texture optimization that is simultaneously sharper and semantically aligned to the text given both positive and negative guidance.

3. Training Procedure and Implementation

The workflow for PaintHuman-Style DSD is structured as a multi-iteration optimization over the SV-BRDF network parameters $\theta$ , using both positive and negative text and image pairs. The key algorithmic steps can be summarized as:

Render current image and depth: $x^i$ , $\text{depth} = \text{Render}(\text{mesh}, \text{SV-BRDF}(\theta), \text{camera})$ .
Encode to latent: $z^i = \text{Encoder}(x^i)$ .
Choose negative pair: If $i>1$ , sample noise and negative prompt embedding; else, use warm-up.
Sample timestep $t$ and noise $\epsilon$ .
Compute classifier-free guided positive and negative predictions.
Form DSD gradient and update $\theta$ via backpropagation.
Optionally adjust the camera view for semantic zoom.

The following pseudocode strictly reflects the data:

Input:
– mesh (fixed topology), text prompt y, negative prompt pool ŷ, λ, total iters N,
– pre-trained StableDiffusion U-Net ε_φ, depth-conditioned variant,
– geometry-to-latent renderer g_θ (θ includes SV-BRDF network parameters)

Initialize:
– θ ← random SV-BRDF network weights (albedo k_d, roughness r, metallic m)
– carry over a buffer for negative latent ŷ_z ← None

for i in 1..N do
  // 1) Render current image & depth
  x^i, depth = Render(mesh, SV-BRDF(θ), camera)
  // 2) Encode to latent
  z^i ← Encoder(x^i)
  // 3) Choose negative pair
  if i>1 then
    ŷ_z_t ← noise_latent(z^{i-1}, shared t, shared ε)
    ŷ_embedding ← random choice from ŷ
  else
    ŷ_z_t ← z^i_t  // warm-up, no subtraction
    ŷ_embedding ← blank
  end
  // 4) Sample t, ε
  t ∼ Uniform, ε ∼ N(0,I)
  // 5) Compute guided predictions
  ẑ_pos ← CFG(ε_φ, z^i_t, y, t)
  ẑ_neg ← CFG(ε_φ, ŷ_z_t, ŷ_embedding, t)
  // 6) Form DSD gradient
  grad = w(t) * ( (ẑ_pos - ε) - λ*(ẑ_neg - ε) )
  // backprop through z^i_t ← g_θ(mesh) to update θ
  θ ← θ - lr * ∇_θ[ grad · (∂z^i_t/∂θ ) ]
  // 7) Optionally adjust camera for semantic zoom every K steps
  if i mod K == 0 then
    camera ← zoom_to_face(mesh)
  end
end for

Output: SV-BRDF network θ defining final textures

4. Depth-Map Geometric Guidance

PaintHuman introduces geometric guidance via depth maps, leveraging the depth-to-image variant of Stable Diffusion rather than the standard unconditional U-Net. For each rendered image, a depth map $D = \text{DepthRender}(\text{mesh}, \text{camera})$ is produced and concatenated with the latent $z_t$ , serving as an additional conditioning channel for the denoising network $\epsilon_\phi$ . No explicit loss term for depth is required; the model's learned depth-to-image prior ensures that local geometry (e.g., cloth wrinkles, accessories) is accurately transferred onto the mesh. This integration strengthens semantic and geometric alignment in the synthesis process.

5. Geometry-Aware SV-BRDF Network

Surface Appearance Representation is realized via a spatially-varying BRDF (SV-BRDF) parameterization. Each query point $x_p$ in mesh space is input to a small coordinate-based MLP $\gamma(x_p)$ , followed by a secondary MLP $\sigma(\cdot)$ mapping to surface material parameters:

$[k_d(x_p), r(x_p), m(x_p)]$ (diffuse albedo, roughness, metallic), with each network implemented using $\approx32$ hidden units. Rendering adheres to a simplified physically-based integral: $R(x_p) = \int_{l\cdot n\geq 0} L_i(l)\cdot(f_d + f_s)(l, v)\cdot(l\cdot n)\, dl$ where
$f_d(x_p) = k_d(x_p)/\pi$
$f_s(l, v)$ uses Disney BRDF parameterization,
$k_s = m\cdot k_d + (1-m)\cdot 0.04$ .

A differentiable split-sum approximation of the specular term is utilized to maintain tractability for $\partial R/\partial \theta$ . A small total-variation or smoothness prior is applied on $k_d$ to encourage local coherence in diffuse albedo.

6. Quantitative and Qualitative Evaluation

PaintHuman’s DSD-driven methodology demonstrates superior performance in both automated metrics and user preference studies. The following results, presented verbatim, encapsulate the comparative findings.

Mean CLIP Score Over Six Views:

Method	Score	Improvement (Δ)
Latent-Paint	24.11	19.99%
TEXTure	25.34	14.17%
Fantasia3D	27.10	6.75%
PaintHuman	28.93	—
DreamHuman	25.79	12.25%
PaintHuman (same prompt)	28.95	—

User Study (Score 1–5):

Method	Score	Improvement (Δ)
Latent-Paint	1.21 ± 0.70	148.8%
TEXTure	1.28 ± 0.60	135.2%
Fantasia3D	1.76 ± 0.70	71.0%
DreamHuman	2.83 ± 0.82	6.4%
PaintHuman	3.01 ± 0.95	—

PaintHuman yields markedly sharper garment folds, discernible accessory detailing (belt, tie), and semantically correct color placements. SDS-only outputs typically exhibit over-smoothed textures; integrating either depth or negative prompts alone mainly enhances local fidelity but does not fully recover semantic correctness. Only PaintHuman’s DSD alignment achieves full fidelity and geometric adherence.

7. Limitations and Outlook

Current usage requires manual curation of negative prompts (e.g., “bad hands”, “ugly”), which might be mitigated by automated negative-prompt mining for improved DSD stability. The method roughly doubles the per-step cost of U-Net inversion due to computing both positive and negative guidance; research into more efficient approximations could lower runtime. Occasional texture seams persist in highly complex clothing (e.g., multilevel lace and prints), suggesting that additional geometric cues (normal or curvature, beyond depth) might be beneficial. The mesh topology remains fixed; enabling local displacement or mesh refinement could enhance representation of fine details, such as wrinkles. Extension to dynamic, nonrigid garments via differentiable cloth simulation remains an active research avenue.

DSD refines vanilla SDS gradients by subtracting a negative image–text pair term, which, in tandem with depth-aware diffusion and SV-BRDF shading, produces high-fidelity 3D human avatar textures with improved semantic and geometric quality compared to previous approaches (Yu et al., 2023).

PDF Markdown Chat (Pro)

References (1)

PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PaintHuman-Style DSD.