Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details (2506.16504v1)

Published 19 Jun 2025 in cs.CV and cs.AI

Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation.

Summary

  • The paper introduces a two-stage diffusion-based pipeline that enhances both detailed shape generation and multi-channel PBR texture synthesis.
  • It leverages a 10B-parameter model with classifier-free guidance, achieving sharp geometric precision and smooth surface continuity even for complex objects.
  • Quantitative and qualitative results, including a 72% win rate in user studies, demonstrate its superior performance over state-of-the-art 3D generation methods.

Hunyuan3D 2.5: High-Fidelity 3D Asset Generation via Scalable Diffusion and PBR Texturing

Hunyuan3D 2.5 introduces a comprehensive two-stage diffusion-based pipeline for texture-rich, high-detail 3D asset generation from one or several input images. The system advances both geometry and PBR-based material synthesis, with architectural and computational improvements that directly address the main constraints of previous work—namely the trade-off between geometric detail, surface cleanliness, and consistent, high-quality physically-based texturing.

Methodological Contributions

The pipeline consists of two sequential stages:

1. Detailed Shape Generation (LATTICE Foundation Model):

  • LATTICE is a 10B-parameter diffusion model trained on large-scale, high-quality 3D datasets, capable of generating 3D meshes with precise correspondence to conditioning images.
  • The model leverages both single- and multi-view inputs, demonstrating robust generalization to complex object classes.
  • Key improvements include the combination of sharp edge preservation and globally smooth surfaces—even for objects with intricate geometry—effectively reducing the perceptual gap between synthesis and hand-crafted models.
  • Practical acceleration is achieved through classifier-free guidance and distillation techniques, reducing inference time without loss in mesh fidelity.

2. Physically-Based Texture Generation:

  • Texture synthesis is extended to multi-channel PBR material maps, outputting albedo, roughness, and metallic channels concurrently with strong spatial and semantic alignment.
  • The model introduces a dual-channel cross-attention mechanism, aligning basecolor-driven semantic cues across PBR maps while ensuring per-channel independence in value generation. The reference attention mask is shared across channels, enforcing spatial coherency.
  • 3D-aware RoPE positional encoding is adopted to maintain cross-view texture consistency.
  • The pipeline features a dual-phase, progressive resolution enhancement: initial training is performed with conventional 512×512 6-view images, followed by a “zoom-in” phase that enables finer detail acquisition at higher resolution (train-time random crops, inference up to 768×768).
  • An illumination-invariant loss further enforces proper separation of intrinsic material properties from lighting effects.

Quantitative and Qualitative Results

The system is benchmarked against a broad set of state-of-the-art open-source and commercial image- and text-guided 3D generation pipelines:

  • Shape Generation: Hunyuan3D 2.5 reports the highest or competitive scores on ULIP-T/I and Uni3D-T/I, with notable advancement in the text- and image-shape similarity metrics (e.g., ULIP-T: 0.07853 vs. 0.0771 for Hunyuan3D 2.0). However, the paper notes that standard metrics may saturate and visual/perceptual evaluations reveal a wider gap favoring 2.5 over all comparators in real-world scenarios.
  • Texture Generation: On FID, CLIP-FID, CMMD, CLIP-I, and LPIPS, the method surpasses both text- and image-conditioned baselines:
    • FID: 165.8 vs. 176.9 (Paint3D)
    • CLIP-FID: 23.97 vs. 26.86 (Paint3D)
    • CLIP-I: 0.9281 vs. 0.8871 (Paint3D)
  • User Study: In direct pairwise comparisons against three leading commercial solutions, Hunyuan3D 2.5 achieved a 72% win rate, an order of magnitude higher than the next-best method on real-world input images.

The model is the first to demonstrate robust, open-source, PBR material generation with high consistency and detail, outperforming both RGB-only and closed PBR solutions in semantic alignment and visual realism.

Practical Implications

From a deployment perspective, Hunyuan3D 2.5 offers an appealing balance between scalability and asset quality. Notable aspects for practitioners include:

  • Modularity: The two-stage approach allows for independent optimization, mixing and matching of geometry and texturing modules, and potential adaptation to other shape or texture priors.
  • Acceleration: Diffusion step distillation (e.g., UniPC-based samplers) and guidance reduce inference time, enabling interactive or near-real-time workflows.
  • Asset Pipeline Compatibility: The use of standard mesh and PBR asset outputs (albedo/MR/normal maps, UV-unwrapped meshes) ensures compatibility with existing 3D content pipelines (game engines, film VFX, VR, etc.).
  • Generalization: High-fidelity generation is sustained across in-the-wild image domains and diverse object categories, suggesting robust feature representations.

Theoretical and Future Directions

Hunyuan3D 2.5 demonstrates that diffusion-based generative models, when scaled in both data and model size, can close the quality gap with artisan-crafted assets—critically, for both shape and physically-based material aspects. Noteworthy theoretical and research implications include:

  • Scalability of Diffusion Models: The monotonic gains observed with scale in LATTICE reinforce the diffusion paradigm for high-dimensional structured data, though dataset curation remains pivotal.
  • Texture-Geometry Coupling: The dual-phase progressive resolution strategy provides a tractable recipe for training high-resolution, geometry-aware generators without incurring prohibitive memory or compute costs.
  • Multi-Channel Attention Techniques: The attention-mask sharing mechanism for cross-material spatial alignment may generalize to other multimodal or multi-output generation problems.

Future directions may include:

  • Integration of end-to-end training for joint optimization of shape and texture,
  • Further acceleration via consistency or adversarial diffusion distillation,
  • Adaptation for controllable or editable asset generation,
  • Extended support for dynamic or deformable object generation (e.g., characters, articulated assets),
  • Inverse rendering and relighting capabilities for enhanced asset realism.

Conclusion

Hunyuan3D 2.5 sets a new reference point for automated, production-ready 3D asset generation using scalable diffusion architectures. Its methodological innovations lead to domain-competitive performance in both objective and subjective evaluations and demonstrate the viability of high-fidelity generative synthesis in practical asset creation pipelines. The architectural and methodological choices detailed in this work are likely to inform future research and industrial practice around large-scale 3D generative models.