Paint-it: Generative PBR Texture Synthesis
- Paint-it is a generative system for creating full PBR texture maps from text prompts, integrating deep neural re-parameterization with diffusion-based guidance.
- It employs a U-Net architecture and score distillation sampling (SDS) to optimize textures, ensuring semantic control and suppression of noisy gradients.
- The system delivers relightable, artifact-free PBR maps for diverse applications like gaming, AR/VR, and film, outperforming prior methods in quality and user scores.
Paint-it is a generative system for text-driven, high-fidelity physically-based rendering (PBR) texture synthesis on 3D meshes, integrating deep convolutional neural re-parameterization with modern diffusion-based guidance. It directly produces full sets of PBR texture maps (diffuse, roughness, metalness, normal) from a text prompt, leveraging an overview-through-optimization approach that combines Score Distillation Sampling (SDS) with a U-Net-based neural parameterization pipeline and image-based physically based rendering. Paint-it addresses key practical and methodological bottlenecks in mesh texturing by enabling semantic-level user control, suppressing optimization artifacts induced by noisy gradient signals from diffusion models, and supporting direct material-level manipulation in downstream engines.
1. System Architecture and Texture Parameterization
Paint-it replaces pixel-level parameterization of mesh UV textures with a deep convolutional neural network (DCNN), specifically a randomly initialized U-Net with skip connections, denoted as . Instead of directly updating each texel, the texture maps are the outputs of this DCNN given a fixed spatial noise field: where:
- : 3-channel diffuse texture
- : 2-channel roughness and metalness
- : 3-channel normal map
- : fixed 2D noise input
This neural re-parameterization supports structured, frequency-ordered optimization (spectral bias), naturally filters out high-frequency noise from gradient updates, and aligns with the hierarchical patterns of real-world material properties. The U-Net prior regularizes the texture maps, providing smooth spatial correlations and robustness to noisy or underspecified text prompts.
2. Optimization Objective via Score Distillation Sampling (SDS)
Core supervision for text match and visual realism arises from Score Distillation Sampling (SDS), as introduced in DreamFusion. At each optimization step:
- The mesh is rendered (physically based rendering, multi-view, environment lighting) with current textures, producing images .
- A pre-trained diffusion model (frozen weights) receives along with the text prompt and a sampled noise level .
- The SDS update is formulated as
where predicts the denoising residual for given prompt , and is the sampled noise.
Optimization thus encourages the rendered view distribution to score highly under the text-conditioned diffusion model, distilling the generative prior into the texture synthesis.
3. DC-PBR: Effect of Deep Convolutional PBR Parameterization
The DC-PBR method contrasts with pixel-wise UV parameterizations and per-point MLP-based schemes. Its advantages include:
- Frequency Curriculum: U-Nets learn low frequencies before high, suppressing the immediate adoption of high-frequency noise from the SDS gradients, and yielding smoother, more realistic early-stage texture synthesis (frequency scheduling).
- Noise Filtering: The spatial inductive bias of convolutions prevents accumulation of incoherent, high-frequency artifacts, especially prevalent in SDS-driven optimization.
- Material Expressiveness: Full PBR maps (diffuse, roughness, metalness, normal) support high-fidelity relighting, material variation, and view-dependent effects in standard rendering pipelines.
Ablation studies show that skipping the DC-PBR architecture (falling back to pixel-wise or MLP param) results in either excessive noise, patchiness, or limited expressiveness, and lower FID/user scores relative to DC-PBR.
4. Rendering Pipeline: Physically-Based and Differentiable
Paint-it uses a standard physically-based rendering (PBR) model with a Cook-Torrance BRDF, rendering images as: where each spatial point on the surface is assigned spatially varying BRDF properties via the neural-generated textures, and is the normal from . This process is differentiable, so SDS gradients backpropagate through both rendering and texture generation.
Integration with differentiable rasterization frameworks (e.g., NVDiffRast) facilitates end-to-end optimization compatible with standard graphics hardware (15–30 min per mesh at full resolution on RTX A6000).
5. Experimental Evaluation and Quantitative Results
Comprehensive validation across models (Objaverse, RenderPeople, SMAL) and general domains demonstrates:
| Method | PBR-Maps | FID (↓) | User Score (↑, /5) |
|---|---|---|---|
| Latent-Paint | No | 57.35 | 2.14 |
| Fantasia3D | No | 51.01 | 2.52 |
| TEXTure | No | 37.28 | 3.21 |
| Paint-it (DC-PBR) | Yes | 34.46 | 4.37 |
Paint-it achieves the best FID and is the only method with a user score above the "realistic" threshold (4.0). User studies confirm high material and semantic quality. Ablation studies confirm the necessity for both DC-PBR and full multi-channel PBR supervision.
6. Practical Applications and Generalization
Paint-it directly yields multi-channel, relightable, and physically correct PBR texture maps compatible with industry-standard engines (Blender, Unreal, Unity), supporting:
- Text-prompt-based 3D asset creation from scratch for arbitrary meshes.
- Relighting and material editing at test time (diffuse/roughness/metalness/normal).
- Support for animated and dynamic meshes (as UVs are preserved).
- View-consistent, artifact-suppressed detail as required in AR, VR, gaming, and film production contexts.
Generalization is demonstrated by successful application to objects, clothed humans, and animals with comparable fidelity.
7. Methodological Impact and Future Directions
Paint-it establishes the importance of neural re-parameterization in optimization-based texture synthesis. By embedding the DC-PBR architecture within the SDS-optimized pipeline, it overcomes the optimization instability typical of diffusion-model supervision, especially in high-dimensional texture spaces.
A plausible implication is the possibility of feedforward, supervised training for real-time applications, as Paint-it's fundamental design decouples texture parameterization from per-pixel or per-point artifacts. The approach also sets the groundwork for future multi-modal and interactive 3D asset creation pipelines, potentially integrating direct user guidance or extensions to other material representations.