UltraShape 1.0: 3D Diffusion for Mesh Synthesis
- UltraShape 1.0 is a scalable 3D diffusion framework that generates high-fidelity, watertight meshes using a two-stage process.
- It decouples global structure synthesis from local geometric refinement, ensuring topological robustness and fine-detail accuracy.
- The method, trained on curated public datasets, achieves competitive performance with lower Chamfer Distance and higher IoU scores for real-world applications.
UltraShape 1.0 is a scalable 3D diffusion framework designed for high-fidelity geometry generation. It achieves competitive quantitative and qualitative performance in mesh synthesis by combining rigorous watertight data curation with a two-stage generation pipeline: an initial coarse structure synthesis followed by fine-grained geometric refinement. UltraShape 1.0 is trained exclusively on publicly available datasets, and its architecture decouples spatial localization from local surface synthesis, enabling scalable, detail-preserving outputs suitable for downstream applications such as physics simulation, CAD, and AR/VR (Jia et al., 24 Dec 2025).
1. Motivation and Architectural Overview
UltraShape 1.0 addresses three principal challenges in 3D generative modeling: limited data availability and irregularity, topological robustness, and scalability to high resolutions. Publicly available 3D meshes are often scarce and noisy, containing cracks, self-intersections, and non-watertight regions that inhibit their downstream usability. Achieving topological robustness, specifically watertightness, is critical for simulation, manufacturing, and visualization use cases. Prior remeshing approaches, including those based on unsigned distance functions (UDFs), visibility checks, or flood-fill, frequently introduce artifacts or fail in handling intricate thin regions.
UltraShape 1.0 employs a rigorously designed data pipeline to ensure watertight geometric validity and uses a two-stage neural architecture to decouple global structure estimation from local detail synthesis. The first stage utilizes a vector-set VAE representation and a DiT-style diffusion transformer to capture overall shape, generating a low-frequency signed distance field (SDF). The second stage focuses on refining local geometry at fixed voxel queries, using rotary positional encoding (RoPE) to inject spatial awareness and facilitate the generation of high-frequency geometric detail.
2. Two-Stage Diffusion Pipeline and Mathematical Formulation
2.1 Diffusion Model Fundamentals
UltraShape 1.0 adopts the denoising diffusion probabilistic model (DDPM) framework. Let denote the initial data, which may be either global vector tokens or voxel latent tokens. The forward process adds Gaussian noise incrementally:
with the closed-form:
The reverse denoising step is parameterized as:
where typically and
with an objective:
2.2 Stage 1: Coarse Geometry Generation
Coarse structure synthesis operates on VAE latent tokens (; 4096–10240 tokens) using a DiT transformer (backbone: Hunyuan3D-2.1). Conditioning occurs via DINOv2 cross-attention to one or more input images, resulting in a low-frequency SDF field on a regular grid (e.g., ). No additional training is performed beyond the pretrained backbone.
2.3 Stage 2: Voxel-Conditioned Geometric Refinement
Refinement is performed on a fixed set of voxel queries (typically ) extracted from the coarse SDF. For each , both and image-alignment cues are obtained. These are treated as initial latent tokens and diffused forward, with denoising learning local corrections:
Decoupling global positioning permits the model to focus exclusively on synthesis of fine, high-frequency details within the local spatial context.
3. Data Processing Pipeline
3.1 Watertight Processing: Sparse-Voxel Watershed
Meshes are voxelized into sparse, CUDA-accelerated grids of up to resolution. The hole-closing procedure identifies gaps and cracks via flood-fill and fills them automatically. Open-surface regions lacking enclosed volume are thickened volumetrically prior to SDF computation. The resulting signed distance field is computed in sparse form and a clean isosurface is extracted using dual marching cubes. Unlike prior visibility-based or pure flood-fill methods, this pipeline closes large holes without inducing noisy surface artifacts.
3.2 High-Quality Data Filtering
Beginning with approximately 800 K models from Objaverse, three filtering stages are performed:
- Vision-LLM (VLM)-Based Filtering: Multi-view depth and normal renders are analyzed by VLMs to filter out primitives, ground planes, and noisy scans.
- Pose Normalization: A learned canonicalization network, with VLM validation, realigns models to a consistent orientation.
- Geometry Filtering: Interior versus exterior point ratios identify thin shells; fragmented shapes are excluded via VAE reconstruction quality.
After spot-checking, this process yields roughly 330 K valid meshes (120 K high quality), forming the training dataset.
4. Spatial Localization and Detail Synthesis
4.1 Decoupling Spatial Localization
Fixed-grid voxel sampling ( resolution) provides explicit spatial anchoring, transforming each token’s learning objective from global positioning to precise local SDF correction. This restricts the solution space and enables targeted refinement.
4.2 Rotary Positional Encoding (RoPE) for 3D Localization
RoPE is employed per Su et al.’s RoFormer for encoding coordinates in self-attention transformers. For and per-axis embedding dimension :
- Define for
- For each 2-dimensional subvector :
with as timestep or layer index. Concatenated across axes, the result is used in attention layers for position-dependent refinement.
5. Training Protocols and Implementation Specifics
The framework is trained on 330 K curated meshes (120 K high quality) from Objaverse, with each object rendered in 16 views ( RGB). Training is performed on 8 NVIDIA H20 GPUs (batch size 32).
- Stage 1: Hunyuan3D-2.1 VAE and DiT backbones are not retrained.
- Stage 2:
- VAE fine-tuning uses 55 K steps, uniform query perturbation in , increasing token count (4096 8192).
- DiT fine-tuning employs progressive schedules:
- Stage A: 4096 tokens, images, 10 K steps
- Stage B: 8192 tokens, images, 15 K steps
- Stage C: 10 240 tokens, images, 60 K steps
- Inference: up to 32 768 tokens, with background masking
Losses consist of diffusion -prediction and VAE reconstruction () sampled at 1.6 million supervision points. AdamW is used for optimization, learning rate 1e-4 with cosine decay.
6. Evaluation: Quantitative and Qualitative Results
6.1 Quantitative Metrics
Performance is assessed via Chamfer Distance (CD) and Intersection-over-Union (IoU) scores. UltraShape 1.0 demonstrates 20–30% lower CD and 5–10% higher IoU compared to open-source baselines such as CLAY, TripoSG, and Sparc3D under matched rendering conditions.
6.2 Qualitative Comparisons
Empirically, outputs exhibit sharper edges, stronger fine detail (notably in features such as chair spokes and gear teeth), and superior image alignment relative to both open-source DiT/vector-set (Hunyuan3D, CLAY, FlashVDM), sparse-voxel (TripoSF, TRELLIS.2), and proprietary systems (DreamFusion, ProlificDreamer).
6.3 Scalability
VAE reconstruction fidelity improves monotonically with increasing token count (4096 32 768) and DiT refinement approaches geometric detail parity with doubled training budgets at inference. This suggests stable scalability in both stages.
7. Limitations and Prospective Directions
UltraShape 1.0’s reliance on watertight preprocessing may be challenged by meshes with intricate internal cavities or ultrathin, perforated shells. Image conditioning is sensitive to segmentation errors and background artifacts, indicating the need for enhanced 2D/3D segmentation and foreground isolation. Computational requirements scale with token count, making two-stage inference costly for ultra-high-resolution outputs.
Potential future directions include integration of text or multimodal conditioning, adaptive voxel grids (octree, hash-grid) for enhanced local detail synthesis, end-to-end joint stage training, and leveraging simulators or CAD constraints for functional geometry generation (Jia et al., 24 Dec 2025).