- The paper presents a two-stage generative pipeline decoupling global structure from fine detail using diffusion transformers and voxel-query refinement.
- It introduces a CUDA-parallel, watertight remeshing process that robustly handles geometrically challenging public 3D datasets.
- Experimental results demonstrate superior geometric fidelity and scaling, rivaling both open-source and commercial 3D generation systems.
UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement
Motivation and Positioning
UltraShape 1.0 introduces a scalable, high-fidelity 3D generative framework designed to address persistent challenges in current 3D generation pipelines: non-uniform and low-quality data, lack of robust watertight mesh processing, limited geometric detail, and poor scalability in high-resolution geometry generation. The framework departs from prior paradigms by adopting a rigorous two-stage pipeline for decoupling global structural synthesis from local geometric detail refinement. This is underpinned by the introduction of a comprehensive data curation and watertightening pipeline and a refinement-centric generative process that makes novel use of structured voxel queries with rotary positional encoding.
Scalable Data Processing and Watertightening
The geometric quality of public 3D datasets such as Objaverse has historically hindered the robustness of 3D generative models. UltraShape 1.0 addresses this with a CUDA-parallel, sparse voxel infrastructure capable of watertight remeshing at the 20483 voxel level, robustly closing holes and regularizing open or thin-shelled surfaces prior to SDF-based shape modeling. This approach outperforms both UDF-based, visibility-check, and flood-fill remeshing strategies, as it robustly resolves topological ambiguities with minimal geometric artifacts and does not suffer from leaking, internal hollowing, or excessive smoothing of fine features.
Figure 1: Comparison of the proposed watertightening method against alternatives; UltraShape 1.0 yields consistently watertight, artifact-free surfaces retaining high-geometric fidelity.
A multi-stage filtering pipeline further refines the raw object set: vision-LLM (VLM)–based filtering removes trivial or misoriented assets, pose normalization ensures geometric alignment critical for transformer-based generation, and geometry filters prune models with thin-shells or excessive disconnected components identified via reconstruction-based diagnostics. Manual inspection finalizes a training corpus of approximately 120K high-quality, watertight, pose-normalized shapes.
Two-Stage Coarse-to-Fine 3D Generation
UltraShape 1.0's generative pipeline is characterized by an explicit architectural decoupling of global structure from fine detail synthesis. The first stage leverages a diffusion transformer (DiT) trained on vector set representations, specializing in global structural priors and providing robust, low-frequency shape anchors. The second stage implements a voxel-query–conditioned diffusion refinement process, driven by structured latent queries derived from the Stage-1 mesh and augmented with rotary positional encodings to inject explicit spatial localization.
Figure 2: System-level overview of UltraShape 1.0, illustrating dual-stage generation with latent encoding, structured queries, and marching cubes–based mesh extraction.
The refinement process is not constrained to surface point queries but learns to operate over volumetric regions, enabling the VAE to synthesize SDF values in non-surface vicinities, which significantly enhances the model's capacity for geometric detail, particularly at higher token counts and resolutions. This explicit spatial decoupling circumvents the convergence difficulty and smoothing artifacts endemic to vector set methods, which otherwise must entangle global spatial and local feature synthesis in a high-entropy latent space.
Analysis of Mesh Refinement and Scaling
Evaluation clearly demonstrates that the proposed two-stage design achieves sharper details and dramatically improved geometric fidelity relative to the unrefined, coarse outputs typical of first-stage only systems.
Figure 3: Side-by-side mesh comparison; right column shows the substantial geometric enhancement conferred by voxel-based Stage-2 refinement over initial coarse geometry.
Test-time scaling experiments corroborate that both VAE-based and DiT-based approaches in UltraShape 1.0 generalize robustly as the number of latent tokens grows, a property lacking in mainstream vector set or surface-only frameworks. With increased latent capacity, the toolkit consistently reconstructs shapes with greater detail and surface clarity, avoiding overfitting or loss of local structure.
Competitive Comparisons
Qualitative results illustrate UltraShape 1.0's superiority to leading open-sourced (e.g., CLAY, Hunyuan3D, LATTICE) and even commercial 3D generation models in terms of geometric detail, surface regularity, and input image–shape consistency.
Figure 4: Comparison to open-source SOTA; UltraShape 1.0 demonstrates superior localization of structure and preservation of fine-scale details.
Figure 5: Additional qualitative comparison against top-tier open-source models.
Notably, these results are achieved using only public data and modest compute resources, challenging the necessity of proprietary datasets or extreme-scale training regimes for high-quality 3D geometry synthesis.
When benchmarked against commercial-grade systems, UltraShape 1.0 achieves nearly indistinguishable quality, exhibiting high-frequency detail and global shape alignment competitive in virtually all tested modalities.
Figure 6: Qualitative comparison to commercial 3D generation systems, confirming geometric fidelity and image-consistency parity.
Stylization and Conditioning
UltraShape 1.0 supports training-free, dual-stage stylization: the coarse global shape can be influenced by one image, while fine details are sculpted according to another, allowing nuanced control and flexibile asset authoring with zero retraining. This is enabled by the fact that the refinement stage is strictly conditioned on the provided voxel tokens, abstracting away global spatial uncertainties.
Theoretical Implications and Future Directions
The results highlight that explicit spatial structuring within latent token space, combined with high-quality, semantically consistent training data, is critical for scalable and high-fidelity 3D generation. The two-stage generation paradigm, especially with voxel-based queries and RoPE-based encoding, offers a blueprint for overcoming the memory and convergence bottlenecks of previous vector/point set–based generative methods.
UltraShape 1.0 suggests promising directions for:
- Enhancing test-time geometry upscaling via token extrapolation, potentially pushing geometric fidelity toward photorealistic engineering or VFX applications without retraining.
- Modular stylization pipelines, where stage-specific conditioning enables controlled shape/content disentanglement for downstream editing or authoring.
- Further analysis of the interplay between data curation, positional encoding, and latent structure in large-scale 3D generative models.
Conclusion
UltraShape 1.0 establishes a robust, scalable benchmark for high-fidelity 3D geometry generation within a resource-efficient and publicly reproducible framework. The integration of a CUDA-efficient watertightening pipeline, stringent data curation, and a novel two-stage coarse-to-fine generative strategy with structured voxel queries distinguishes it from prior art and demonstrates that scalable, SOTA-quality 3D assets can be generated from purely open data and non-proprietary toolchains. This work lays a solid foundation for future development of practical, open generative 3D toolchains and provides significant insights for scaling geometric detail in large generative models (2512.21185).