Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 117 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention (2507.17745v3)

Published 23 Jul 2025 in cs.CV and cs.AI

Abstract: Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

Summary

The paper presents Ultra3D, an efficient two-stage framework that leverages localized Part Attention to accelerate 3D mesh generation.
It utilizes a compact VecSet representation and per-voxel latent refinement to enhance semantic coherence and reduce computational cost.
Experimental results demonstrate superior visual fidelity and richer surface details compared to state-of-the-art methods.

Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

The paper "Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention" presents a novel framework designed to enhance the efficiency and quality of 3D mesh generation using a localized attention mechanism. This involves a two-stage process that alleviates computational bottlenecks prevalent in conventional 3D generation pipelines.

Introduction to Ultra3D

Ultra3D builds upon recent advancements in sparse voxel representations, which facilitate high-resolution modeling with fine geometric details. However, existing methods suffer from computational inefficiencies due to the quadratic complexity of attention mechanisms, especially in two-stage diffusion pipelines. Ultra3D proposes a more efficient framework, leveraging a compact VecSet representation to generate a coarse object layout in the initial stage, subsequently refining per-voxel latent features with a geometry-aware localized attention mechanism known as Part Attention. This innovation reduces unnecessary global attention, achieving up to a 6.7× speed-up in latent generation.

Part Attention Mechanism

The unique feature of Ultra3D is Part Attention, which performs attention computation independently within each semantically consistent part group. This mechanism replaces the traditional full attention computation and partitions attention tasks to align with geometric borders. Part Attention respects semantic structures and improves computational efficiency by avoiding redundant calculations across semantically unrelated parts.

Figure 1: Experiments on different attention mechanisms reveal that fixed 3D Window Attention often misaligns with semantic boundaries, causing style inconsistencies, which Part Attention solves.

Framework and Pipeline

The pipeline consists of two stages: sparse voxel generation via the VecSet representation and refinement using per-voxel latent generation. By using a scalable part annotation pipeline, Ultra3D efficiently converts raw meshes into part-labeled sparse voxels.

Sparse Voxel Generation: The initial stage employs VecSet for a compact 3D representation to produce a coarse mesh, which is then voxelized. This process significantly reduces token count, minimizing the computational footprint.
Sparse Latent Generation: In the refinement stage, Ultra3D applies Part Attention to enhance efficiency while preserving semantic coherence, enabling high-resolution generation without full global attention.
Figure 2: Ultra3D’s pipeline involves generating a sparse voxel layout via VecSet and refining it with per-voxel latent generation, core to which is Part Attention.

Results and Comparisons

Ultra3D demonstrates superior performance and efficiency in comparison with state-of-the-art methods. Qualitative results indicate enhanced visual fidelity and richer surface details. Moreover, Part Attention delivers generation quality comparable to full global attention while significantly cutting down computational cost.

Figure 3: Ultra3D produces higher fidelity and richer surface details compared to prior methods, aligning closely with input images.

Implications and Future Directions

The introduction of Part Attention marks a significant advancement in sparse voxel-based 3D generation, addressing key efficiency challenges. Ultra3D's capacity to scale to high resolutions with reduced computational demands holds promise for applications in gaming, AR/VR, and other domains requiring intricate 3D content creation. Future research may explore extending Part Attention to more complex geometric structures or integrating it with emerging 3D dataset annotations to further refine 3D modeling capabilities.

Conclusion

Ultra3D sets a new benchmark for 3D generation, combining efficiency and high fidelity through novel attention mechanisms. The framework not only advances state-of-the-art performance but also opens avenues for scalable and semantically consistent 3D modeling, essential for the evolving demands of digital content creation.