Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details (2506.16504v1)

Published 19 Jun 2025 in cs.CV and cs.AI

Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation.

Summary

The paper introduces a two-stage diffusion-based pipeline that enhances both detailed shape generation and multi-channel PBR texture synthesis.
It leverages a 10B-parameter model with classifier-free guidance, achieving sharp geometric precision and smooth surface continuity even for complex objects.
Quantitative and qualitative results, including a 72% win rate in user studies, demonstrate its superior performance over state-of-the-art 3D generation methods.

Hunyuan3D 2.5: High-Fidelity 3D Asset Generation via Scalable Diffusion and PBR Texturing

Hunyuan3D 2.5 introduces a comprehensive two-stage diffusion-based pipeline for texture-rich, high-detail 3D asset generation from one or several input images. The system advances both geometry and PBR-based material synthesis, with architectural and computational improvements that directly address the main constraints of previous work—namely the trade-off between geometric detail, surface cleanliness, and consistent, high-quality physically-based texturing.

Methodological Contributions

The pipeline consists of two sequential stages:

1. Detailed Shape Generation (LATTICE Foundation Model):

LATTICE is a 10B-parameter diffusion model trained on large-scale, high-quality 3D datasets, capable of generating 3D meshes with precise correspondence to conditioning images.
The model leverages both single- and multi-view inputs, demonstrating robust generalization to complex object classes.
Key improvements include the combination of sharp edge preservation and globally smooth surfaces—even for objects with intricate geometry—effectively reducing the perceptual gap between synthesis and hand-crafted models.
Practical acceleration is achieved through classifier-free guidance and distillation techniques, reducing inference time without loss in mesh fidelity.

2. Physically-Based Texture Generation:

Texture synthesis is extended to multi-channel PBR material maps, outputting albedo, roughness, and metallic channels concurrently with strong spatial and semantic alignment.
The model introduces a dual-channel cross-attention mechanism, aligning basecolor-driven semantic cues across PBR maps while ensuring per-channel independence in value generation. The reference attention mask is shared across channels, enforcing spatial coherency.
3D-aware RoPE positional encoding is adopted to maintain cross-view texture consistency.
The pipeline features a dual-phase, progressive resolution enhancement: initial training is performed with conventional 512×512 6-view images, followed by a “zoom-in” phase that enables finer detail acquisition at higher resolution (train-time random crops, inference up to 768×768).
An illumination-invariant loss further enforces proper separation of intrinsic material properties from lighting effects.

Quantitative and Qualitative Results

The system is benchmarked against a broad set of state-of-the-art open-source and commercial image- and text-guided 3D generation pipelines:

Shape Generation: Hunyuan3D 2.5 reports the highest or competitive scores on ULIP-T/I and Uni3D-T/I, with notable advancement in the text- and image-shape similarity metrics (e.g., ULIP-T: 0.07853 vs. 0.0771 for Hunyuan3D 2.0). However, the paper notes that standard metrics may saturate and visual/perceptual evaluations reveal a wider gap favoring 2.5 over all comparators in real-world scenarios.
Texture Generation: On FID, CLIP-FID, CMMD, CLIP-I, and LPIPS, the method surpasses both text- and image-conditioned baselines:
- FID: 165.8 vs. 176.9 (Paint3D)
- CLIP-FID: 23.97 vs. 26.86 (Paint3D)
- CLIP-I: 0.9281 vs. 0.8871 (Paint3D)
User Study: In direct pairwise comparisons against three leading commercial solutions, Hunyuan3D 2.5 achieved a 72% win rate, an order of magnitude higher than the next-best method on real-world input images.

The model is the first to demonstrate robust, open-source, PBR material generation with high consistency and detail, outperforming both RGB-only and closed PBR solutions in semantic alignment and visual realism.

Practical Implications

From a deployment perspective, Hunyuan3D 2.5 offers an appealing balance between scalability and asset quality. Notable aspects for practitioners include:

Modularity: The two-stage approach allows for independent optimization, mixing and matching of geometry and texturing modules, and potential adaptation to other shape or texture priors.
Acceleration: Diffusion step distillation (e.g., UniPC-based samplers) and guidance reduce inference time, enabling interactive or near-real-time workflows.
Asset Pipeline Compatibility: The use of standard mesh and PBR asset outputs (albedo/MR/normal maps, UV-unwrapped meshes) ensures compatibility with existing 3D content pipelines (game engines, film VFX, VR, etc.).
Generalization: High-fidelity generation is sustained across in-the-wild image domains and diverse object categories, suggesting robust feature representations.

Theoretical and Future Directions

Hunyuan3D 2.5 demonstrates that diffusion-based generative models, when scaled in both data and model size, can close the quality gap with artisan-crafted assets—critically, for both shape and physically-based material aspects. Noteworthy theoretical and research implications include:

Scalability of Diffusion Models: The monotonic gains observed with scale in LATTICE reinforce the diffusion paradigm for high-dimensional structured data, though dataset curation remains pivotal.
Texture-Geometry Coupling: The dual-phase progressive resolution strategy provides a tractable recipe for training high-resolution, geometry-aware generators without incurring prohibitive memory or compute costs.
Multi-Channel Attention Techniques: The attention-mask sharing mechanism for cross-material spatial alignment may generalize to other multimodal or multi-output generation problems.

Future directions may include:

Integration of end-to-end training for joint optimization of shape and texture,
Further acceleration via consistency or adversarial diffusion distillation,
Adaptation for controllable or editable asset generation,
Extended support for dynamic or deformable object generation (e.g., characters, articulated assets),
Inverse rendering and relighting capabilities for enhanced asset realism.

Conclusion

Hunyuan3D 2.5 sets a new reference point for automated, production-ready 3D asset generation using scalable diffusion architectures. Its methodological innovations lead to domain-competitive performance in both objective and subjective evaluations and demonstrate the viability of high-fidelity generative synthesis in practical asset creation pipelines. The architectural and methodological choices detailed in this work are likely to inform future research and industrial practice around large-scale 3D generative models.

PDF Markdown