Hunyuan3D-Paint: Diffusion-Based Texture Synthesis for 3D Assets

Updated 25 June 2025

Hunyuan3D-Paint is a large-scale, diffusion-based texture synthesis system designed to generate high-fidelity, spatially-aligned, and physically-plausible texture maps for 3D assets. As a foundational component of the Hunyuan3D series of frameworks for 3D AI-generated content (AIGC), Hunyuan3D-Paint is responsible for transforming untextured 3D meshes into photorealistic models with production-quality, physically-based rendering (PBR) materials—including albedo, metallic, and roughness maps. The system is characterized by its multi-view, geometry-conditioned diffusion architecture, advanced attention mechanisms, and integration with robust 3D generative pipelines.

1. Architectural Foundations and Integration

Hunyuan3D-Paint operates as the primary texture/material generator within the two-stage Hunyuan3D pipeline, which separates geometry and texture synthesis for both scalability and asset quality. In typical deployment:

Stage 1: Shape Generation is handled by Hunyuan3D-DiT (or LATTICE in the 2.5 version), which outputs a detailed, watertight mesh from input images.
Stage 2: Texture Synthesis is performed by Hunyuan3D-Paint, which takes as input the mesh geometry, reference images (delighted and canonicalized), normal and camera coordinate maps, and viewpoint information.

The texture model uses a dual-branch UNet with a multi-channel variational autoencoder (VAE) backbone and incorporates geometry features and view direction as explicit conditioning. Conditioning inputs include mesh-derived normal and canonical coordinate maps, and reference image features processed through a frozen encoder.

Integration with shape generation is strictly modular: mesh outputs from Hunyuan3D-DiT or LATTICE are consumed by Hunyuan3D-Paint for texturing, supporting workflows where either or both stages can be swapped for user-supplied data.

2. Multi-View PBR Diffusion and Attention Mechanisms

The core innovation of Hunyuan3D-Paint lies in its multi-view, PBR-focused diffusion architecture, developed further in versions 2.1 and 2.5. The system generates albedo, metallic, and roughness maps consistently across multiple viewpoints, enabling highly detailed, relightable assets suitable for modern rendering engines.

Multi-View Diffusion Architecture

Viewpoint Conditioning: Optimized algorithms select 8–12 camera views that maximize UV map coverage based on mesh geometry, allowing efficient and comprehensive texture generation.
Dual-Branch UNet: Separate but coordinated branches generate albedo and MR (metallic-roughness) channels, using spatially shared attention masks to maintain inter-channel and cross-view alignment.
Multi-Attention Mechanisms: Parallel self-attention, reference attention, and multi-view attention branches are implemented. The following multi-task attention formula governs feature integration:

$Z_{MVA} = Z_{SA} + \lambda_{ref} \cdot \mathrm{Softmax}\left(\frac{Q_{ref} K_{ref}^T}{\sqrt{d}}\right) V_{ref} + \lambda_{mv} \cdot \mathrm{Softmax}\left(\frac{Q_{mv} K_{mv}^T}{\sqrt{d}}\right) V_{mv}$

where $Z_{SA}$ is the self-attention output, $(Q, K, V)$ denote query, key, and value tensors for reference and multi-view, and $\lambda_{ref}, \lambda_{mv}$ are learned weights.

Material Channel Alignment: Spatial alignment is achieved by sharing the attention mask computed from the albedo branch across MR and normal branches, while semantics may differ for each channel.

3D-Aware Rotary Positional Encoding (RoPE)

A 3D-aware RoPE mechanism encodes view and spatial position information within the latent representations, promoting cross-view consistency and minimizing seams and ghosting in generated textures. This mechanism uses multi-level mesh coordinate downsampling, allowing positional encoding to match UNet hierarchy depth.

3. Illumination-Invariant and Physically-Based Texture Synthesis

To ensure generation of textures that are robust to lighting variations, Hunyuan3D-Paint employs an illumination-invariant training strategy. This involves presenting the network with reference images of the same mesh rendered under different lighting conditions and minimizing a consistency loss on predicted albedo:

$\mathcal{L}_{\text{consistency}} = \|A_1 - A_2\|_{1,2}$

where $A_1$ and $A_2$ are albedo maps predicted from different lighting inputs. This approach disentangles intrinsic material color from input-dependent illumination.

PBR material maps (albedo, metallic, roughness) are synthesized according to the Disney Principled BRDF model, ensuring direct compatibility with industry-standard game and content creation engines.

4. Training Procedure and Resolution Strategy

Hunyuan3D-Paint initializes the diffusion backbone from established checkpoints (e.g., Zero-SNR Stable Diffusion 2.1) and trains on large-scale, multi-view, mesh-texture pairs:

Resolution Strategy: Training is performed in two phases to balance computational efficiency and texture detail:
1. Standard multi-view diffusion at $512 \times 512$ across 6 views establishes spatial consistency.
2. "Zoom-in" training with randomly cropped $768 \times 768$ windows enhances local detail without prohibitive memory usage.

Super-resolution modules refine output views before texture baking.

5. Texture Baking, UV Unwrapping, and Postprocessing

Synthesized images from the multi-view diffusion model are projected back onto the mesh UV map. Uncovered UV texels are identified using geometry-aware coverage functions and filled using interpolation or view-based inpainting. The process is as follows:

Render and synthesize multi-view texture images.
UV unwrapping and projection of images onto mesh surface.
Inpainting of uncovered texels and postprocessing to ensure seamlessness.

At inference, "dense view" generation minimizes inpainting requirements and supports user-driven, viewpoint-specific renders.

6. Evaluation Metrics and Comparative Performance

Hunyuan3D-Paint achieves state-of-the-art performance in both automated metrics and user studies, showing measurable improvements over previous baseline and commercial systems:

Metric	Hunyuan3D-Paint (2.0)	Hunyuan3D 2.5 (latest)	Best Previous SOTA
CLIP-FID ↓	26.44	23.97	26.86–35.75
FID ↓	26.44 (CLIP-FID)	165.8	176.9–189.2
CMMD ↓	2.318	2.064	2.400–3.047
CLIP-I ↑	0.8893	0.9281	0.8499–0.8871
LPIPS ↓	0.0059	0.1231	0.0076–0.1538

User studies across hundreds of asset generations confirm higher-quality perception, prompt/image alignment, and diversity. The system supports both automatically-generated and artist-supplied meshes, increasing usability.

7. Applications and Significance

Hunyuan3D-Paint supports critical use cases in contemporary digital content creation:

Gaming and Animation: Generation of ready-to-use, PBR-compliant high-resolution assets with physically-correct materials for rapid pipeline integration.
Virtual and Augmented Reality: Consistent material properties under dynamic relighting enable immersive, realistic 3D experiences.
Industrial and Product Design: Efficient, iterative design visualization from 2D sketches or reference photos, with photorealistic material behaviors.
Open-Source and Democratized Creation: Coupled with Hunyuan3D-Studio, the system is designed for both professional and amateur workflows, supporting image-to-texture and "bring your own mesh" capabilities under full open-source release.

A plausible implication is that Hunyuan3D-Paint and its iterative multi-view, geometry-conditioned, and physically-based diffusion architecture establish a new reference point for automated 3D asset texturing—enabling scalable, high-fidelity, and relightable 3D AIGC suited to the demands of modern interactive media and design pipelines.

PDF Markdown Chat (Pro)