Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

115 tokens/sec

GPT-4o

67 tokens/sec

Gemini 2.5 Pro Pro

54 tokens/sec

o3 Pro

13 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

24 tokens/sec

2000 character limit reached

Hunyuan3D 2.1: Open-Source 3D Asset Generation System

Updated 25 June 2025

Hunyuan3D 2.1 is an open-source system for high-fidelity, production-ready 3D asset generation from images, encompassing advanced shape generation and physically-based rendering (PBR) texture synthesis. Building on its predecessors, Hunyuan3D 2.1 introduces a modular, two-stage diffusion architecture designed for flexibility, scalability, and rigorous production standards in gaming, virtual/augmented reality (VR/AR), and industrial design.

1. System Architecture and Components

Hunyuan3D 2.1 employs a dual-module structure:

Hunyuan3D-DiT (Shape Generation): A latent diffusion transformer model conditioned on image features, built atop a mesh autoencoder (ShapeVAE). It maps a single RGB input image to high-fidelity, watertight mesh geometry.

Mesh Autoencoder (ShapeVAE):

Combines uniform and curvature-aware (importance) sampling for point clouds, encoding surface details as latent tokens. Variable token lengths ensure scalability for objects of varying complexity.
Flow-based Diffusion:

Trains to predict a velocity field transforming Gaussian noise into valid shape latent sequences with the objective

$x_t = (1-t) x_0 + t x_1, \quad u_t = x_1 - x_0$

and loss

$\mathcal{L} = \mathbb{E}_{t,x_0,x_1} \left[ \| u_\theta(x_t, c, t) - u_t \|_2^2 \right]$

Hunyuan3D-Paint (Texture Synthesis): A mesh-conditioned, multi-view diffusion model that generates albedo, metallic, and roughness PBR maps, enforcing both illumination invariance and multi-view consistency.

Spatial-Aligned Multi-Attention:

Attends across different material channels to link texture properties.
3D-Aware Rotary Positional Encoding (RoPE):

Encodes view and spatial structure, enforcing cross-view seam-freeness.
Illumination-Invariant Training:

Ensures the system learns intrinsic material properties instead of lighting artifacts.

The architecture supports independent or sequential use of the shape and texture modules, facilitating workflow flexibility.

2. Data Preparation and Preprocessing Pipeline

Data curation and preprocessing in Hunyuan3D 2.1 are meticulous to ensure data fidelity, variety, and modularity:

Shape Datasets:

Over 100,000 meshes and associated multi-view images are collected from ShapeNet, ModelNet40, Thingi10K, and Objaverse for robust latent shape modeling.

Texture Datasets:

70,000 textured assets, primarily filtered from Objaverse-XL, provide high-quality and diverse material references.

Shape Preprocessing:
- Normalization: Meshes are scaled and centered within the unit cube.
- Watertight Conversion: Conversion to closed surfaces via signed distance fields (SDF) and marching cubes.
- Hybrid Surface Sampling: Points are selected with both uniform and importance/cuvature sampling.
- Conditional Image Rendering: Up to 150 multi-view images are rendered per mesh.
Texture Preprocessing:
- Meshes are rendered from diverse viewpoints and lighting, and multi-channel maps (albedo, metallic, roughness) at 512×512 resolution are generated.
- Lighting variations augment robustness for illumination-invariant training.

3. Training Strategies and Model Optimization

Training of Hunyuan3D 2.1 leverages a suite of techniques:

Shape VAE and Diffusion:
- Loss Functions:
- SDF MSE reconstruction with KL-divergence regularization:
$\mathcal{L}_r = \mathbb{E}_{x \in \mathbb{R}^3} [ \mathrm{MSE}(\mathcal{D}_s(x | Z_s), \mathrm{SDF}(x)) ] + \gamma \mathcal{L}_{KL}$ - Multi-resolution Training:

Latent token sequence length is varied up to 3072, enabling adaptation to complex geometry. - Conditional Augmentation:

Condition images undergo background removal, centering, and resizing to emphasize relevant content.
Texture Diffusion (Hunyuan3D-Paint):
- Pretraining from Zero-SNR Stable Diffusion 2.1 and AdamW optimizer at a learning rate of $5 \times 10^{-5}$ with 2000-step warmup.
- Approx. 180 GPU-days consumed for training through robust and large-scale data throughput.
- Illumination Consistency Loss:
- Cross-lighting consistency loss during training encourages the network to learn true albedo and material properties.
- Multi-view and 3D-aware Encoding:
- Each training sample consists of randomly selected multi-view renders to enforce view consistency.

4. Evaluation Metrics and Benchmarking

Hunyuan3D 2.1 employs comprehensive quantitative and qualitative metrics:

Geometry Metrics:
- ULIP-I/T: Unified Language, Image, Pointcloud similarity scores (point cloud to image/text prompt similarity).
- Uni3D-I/T: Similarity scores based on another unified 3D representation.
- Qualitative visual comparison for mesh fidelity and faithfulness to input images.
Texture Metrics:
- CLIP-FID: Evaluates perceptual distance to real asset textures; lower is superior.
- CMMD: Cross-modal distributional similarity.
- CLIP-I: CLIP-based image similarity.
- LPIPS: Measures patch-level perceptual similarity.
Comprehensive Evaluation:

Metrics ensure faithfulness of geometry, photorealism, physical correctness, view consistency, and production readiness.

5. Workflow and Deployment

The Hunyuan3D 2.1 workflow is modular and suited for both batch and interactive deployment:

End-to-End Pipeline:

Input an RGB image.
Shape diffusion transformer (Hunyuan3D-DiT) produces latent tokens, which the VAE decodes into a 3D mesh.
Hunyuan3D-Paint generates PBR texture maps (albedo, metallic, roughness) in a multi-view, lighting-agnostic manner.
Outputs are ready-to-use for content creation pipelines in game engines, AR/VR systems, and product design tools.

Flexible Use Cases:

The modular structure supports mesh-only, texture-only, or full asset generation, fitting diverse digital content requirements.

6. Applications and Production Suitability

Hunyuan3D 2.1 addresses multiple industrial domains:

Gaming:

Mass, unique, and realistic asset generation with PBR-compliant materials for direct insertion in engines such as Unity and Unreal.

VR/AR:

Production of lightweight meshes with seamless, view-consistent materials that avoid immersion-breaking artifacts.

Industrial Design:

Enables rapid 3D prototyping from concept images or product photos, with support for digital-twin workflows and custom material editing.

The approach enables both asset prototyping and in-game/runtime customization.

7. Open-source Impact and Future Directions

Hunyuan3D 2.1 is made fully open-source, encompassing models, data, and training code. Its modularity and performance aim to democratize 3D AI-generated content, fostering research and development across both academic and industrial settings. By bridging advanced academic methodologies with real-world asset requirements, Hunyuan3D 2.1 lowers the barrier for AI-powered 3D asset generation and supports a wide range of new applications in digital content creation.

The official codebase and pre-trained models are available at https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1.

PDF Markdown Chat (Pro)