Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation (2501.12202v3)

Published 21 Jan 2025 in cs.CV

Abstract: We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2

Summary

The paper introduces a two-stage diffusion pipeline that decouples 3D shape generation (via Hunyuan3D-DiT) from texture synthesis (through Hunyuan3D-Paint) for single-image input.
It employs advanced methods including latent diffusion, importance sampling, and multi-view consistency to capture fine-grained details and produce seamless textures.
Evaluations demonstrate state-of-the-art reconstruction and texture quality, with integration into Hunyuan3D-Studio enabling sketch-to-3D conversion and low-poly stylization.

Hunyuan3D 2.0 is an advanced, large-scale system designed for generating high-resolution textured 3D assets from a single image input. The system is presented as an open-source initiative aimed at providing foundational generative models for the 3D community. It employs a two-stage pipeline: first generating a bare 3D mesh, and then synthesizing a high-quality texture map for it. This decoupled approach offers flexibility, allowing users to texture both generated and existing meshes.

The core of Hunyuan3D 2.0 consists of two large-scale foundation models: Hunyuan3D-DiT for shape generation and Hunyuan3D-Paint for texture synthesis.

Generative 3D Shape Generation (Hunyuan3D-DiT)

The shape generation component follows a latent diffusion model architecture. It comprises:

Hunyuan3D-ShapeVAE: A variational autoencoder that compresses a 3D polygon mesh into a sequence of continuous latent tokens.
- Representation: It uses vector sets as a compact neural representation of 3D shapes.
- Encoder Input: Takes 3D coordinates and normal vectors of point clouds sampled from the mesh surface.
- Key Innovation: Incorporates an Importance Sampled Point-Query Encoder. In addition to uniform surface sampling, it uses an importance sampling method to collect more points on high-frequency detail areas like edges and corners. These are combined with uniformly sampled points to form the input point cloud. Farthest Point Sampling (FPS) is applied separately to the uniform and importance samples to generate corresponding point queries. Cross-attention and self-attention layers process these to get a hidden shape representation, which is then projected to predict the mean and variance of the latent token sequence. This importance sampling helps capture fine-grained details.
- Decoder: Reconstructs a 3D neural field (specifically, Signed Distance Function - SDF) from the latent tokens, which can then be converted into a triangle mesh using the marching cubes algorithm.
- Training: Supervised with reconstruction loss (MSE on sampled spatial points and surface points) and a KL-divergence loss for regularization. A multi-resolution strategy varies the latent token sequence length during training to balance computation and quality.
Hunyuan3D-DiT: A flow-based diffusion model trained on the latent space of ShapeVAE to generate latent shape token sequences from an image prompt.
- Network Structure: Uses a transformer architecture inspired by FLUX, featuring both dual-stream (latent and condition tokens interact via attention) and single-stream (concatenated tokens processed together) blocks. Positional embedding for latent tokens is omitted as their sequence position doesn't map to a fixed 3D location; the token content encodes the shape information.
- Condition Injection: Uses a large pre-trained image encoder (DINOv2 Giant) on $518 \times 518$ input images. Image preprocessing (background removal, centering, white background) is crucial to improve effective resolution and remove noise.
- Training: Uses the flow matching objective, training the model to predict the velocity field of an affine path between a Gaussian distribution ( $x_0$ ) and the data distribution ( $x_1$ ).
- Inference: Uses a first-order Euler ODE solver to generate $x_1$ starting from a randomly sampled $x_0 \sim \mathbb{N}(0,1)$ and iteratively applying the learned velocity field.

Evaluation shows Hunyuan3D-ShapeVAE achieves state-of-the-art reconstruction performance, particularly in capturing fine details and producing clean meshes. Hunyuan3D-DiT generates bare meshes that demonstrate superior alignment with input image conditions compared to other models.

Generative Texture Map Synthesis (Hunyuan3D-Paint)

Hunyuan3D-Paint is designed to produce high-resolution, seamless, and light-invariant texture maps for a given mesh and image. Its framework includes three stages:

Pre-processing:
- Image Delighting Module: An image-to-image model trained to convert input images with complex lighting and shadows into an unlit state, preventing illumination from being baked into the texture. Trained on pairs of rendered 3D assets under HDRI and white light.
- View Selection Strategy: A geometry-aware greedy algorithm (Algorithm 1 in the paper) that selects a minimal set of viewpoints (8 to 12) to maximize surface coverage for multi-view generation, reducing the need for extensive post-inpainting.
Multi-view Image Synthesis (Hunyuan3D-Paint): A geometry-conditioned multi-view diffusion model based on Stable Diffusion 2.1.
- Architecture: Extends a base image diffusion model with mechanisms for image conditioning, multi-view consistency, and geometry conditioning.
- Double-stream Image Conditioning Reference-Net: A reference branch that takes noiseless VAE features of the delightful reference image. Unlike previous methods, it uses frozen weights from the original SD2.1 to act as a soft regularization, preventing the model from drifting towards the rendered dataset's style and improving performance on real images. Features from this branch are integrated via a reference attention module.
- Multi-task Attention Mechanism: Introduces two additional parallel attention modules alongside the original self-attention: reference attention for image following and multi-view attention for consistency across generated views. The combined output is $Z_{MVA} = Z_{SA} + \lambda_{ref} \cdot \text{Softmax}\left(\frac{Q_{ref}K_{ref}^T}{\sqrt{d}\right) V_{ref} + \lambda_{mv} \cdot \text{Softmax}\left(\frac{Q_{mv}K_{mv}^T}{\sqrt{d}\right) V_{mv}$.
- Geometry and View Conditioning: Geometry conditions (canonical normal maps and canonical coordinate maps) are encoded via a VAE and concatenated with latent noise as input. Learnable camera embeddings are also used to provide viewpoint information.
- Training: Inherits weights from SD2.1 and trains on a large self-collected 3D dataset rendered under white light. Uses techniques like ZSNR scheduler and random azimuth/elevation for reference image rendering to increase robustness. A view dropout strategy is employed, randomly selecting 6 out of 44 pre-set viewpoints per batch, to enhance 3D perception and generalization for dense-view inference.
Texture Baking:
- Dense-view Inference: Allows inference from a dense set of viewpoints (not limited to the selected 8-12) to maximize surface coverage.
- Single Image Super-resolution: Applies ESRGAN to each generated multi-view image to enhance texture quality before baking.
- Texture Inpainting: For remaining small uncovered areas in the UV map, an intuitive approach is used: projecting the existing UV texture to vertices and then querying uncovered UV texels by a weighted sum of textures from connected, textured vertices based on geometric distance.

Hunyuan3D-Paint can generate textures guided by text or image prompts, even for hand-crafted meshes, by using T2I models like ControlNet or IP-Adapter to generate an intermediate image aligned with the geometry. Evaluations demonstrate superior performance in texture map synthesis quality and semantic following compared to other methods. The generated texture maps are seamless and lighting-invariant, supporting applications like 're-skinning' (applying different textures to the same mesh).

Textured 3D Assets Generation (End-to-End)

Evaluating the full pipeline (Hunyuan3D-DiT + Hunyuan3D-Paint) shows that Hunyuan3D 2.0 outperforms baselines in generating high-quality, condition-following textured 3D assets, as measured by various image metrics on renderings and confirmed by a user paper.

Hunyuan3D-Studio

The paper introduces Hunyuan3D-Studio, a platform integrating Hunyuan3D 2.0 models with additional tools to create a user-friendly 3D production pipeline. Key features highlighted include:

Sketch-to-3D: Converts 2D sketches into detailed images (likely via T2I models conditioned by sketches) and then uses the core generation system to produce textured 3D assets.
Low-polygon Stylization: Reduces the face count of dense generated meshes for computational efficiency. Uses traditional geometric editing (vertex merging via quadric error metrics) and texture preserving (querying textures from the original dense mesh using a KD-tree and transferring them to the low-poly mesh via vertex color baking).
3D Character Animation: Enables animating generated characters. Extracts mesh features, uses Graph Neural Networks (GNNs) to detect skeleton key points and assign skinning weights, and applies motion retargeting based on templates.

Hunyuan3D 2.0 represents a significant step in large-scale 3D generative modeling, providing high-quality, condition-aligned, textured 3D assets and an integrated platform for accessibility and further research. The release of the code and pre-trained weights aims to foster development in the open-source 3D community.