UltraShape 1.0: 3D Diffusion for Mesh Synthesis

Updated 31 December 2025

UltraShape 1.0 is a scalable 3D diffusion framework that generates high-fidelity, watertight meshes using a two-stage process.
It decouples global structure synthesis from local geometric refinement, ensuring topological robustness and fine-detail accuracy.
The method, trained on curated public datasets, achieves competitive performance with lower Chamfer Distance and higher IoU scores for real-world applications.

UltraShape 1.0 is a scalable 3D diffusion framework designed for high-fidelity geometry generation. It achieves competitive quantitative and qualitative performance in mesh synthesis by combining rigorous watertight data curation with a two-stage generation pipeline: an initial coarse structure synthesis followed by fine-grained geometric refinement. UltraShape 1.0 is trained exclusively on publicly available datasets, and its architecture decouples spatial localization from local surface synthesis, enabling scalable, detail-preserving outputs suitable for downstream applications such as physics simulation, CAD, and AR/VR (Jia et al., 24 Dec 2025).

1. Motivation and Architectural Overview

UltraShape 1.0 addresses three principal challenges in 3D generative modeling: limited data availability and irregularity, topological robustness, and scalability to high resolutions. Publicly available 3D meshes are often scarce and noisy, containing cracks, self-intersections, and non-watertight regions that inhibit their downstream usability. Achieving topological robustness, specifically watertightness, is critical for simulation, manufacturing, and visualization use cases. Prior remeshing approaches, including those based on unsigned distance functions (UDFs), visibility checks, or flood-fill, frequently introduce artifacts or fail in handling intricate thin regions.

UltraShape 1.0 employs a rigorously designed data pipeline to ensure watertight geometric validity and uses a two-stage neural architecture to decouple global structure estimation from local detail synthesis. The first stage utilizes a vector-set VAE representation and a DiT-style diffusion transformer to capture overall shape, generating a low-frequency signed distance field (SDF). The second stage focuses on refining local geometry at fixed voxel queries, using rotary positional encoding (RoPE) to inject spatial awareness and facilitate the generation of high-frequency geometric detail.

2. Two-Stage Diffusion Pipeline and Mathematical Formulation

2.1 Diffusion Model Fundamentals

UltraShape 1.0 adopts the denoising diffusion probabilistic model (DDPM) framework. Let $x_0$ denote the initial data, which may be either global vector tokens or voxel latent tokens. The forward process adds Gaussian noise incrementally:

$q(x_t\mid x_{t-1}) = \mathcal{N}\bigl(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t I\bigr), \quad t=1,\dots,T$

with the closed-form:

$x_t = \sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t}\, \epsilon,\quad \epsilon\sim\mathcal{N}(0, I), \quad \alpha_t = \prod_{s=1}^t (1-\beta_s).$

The reverse denoising step is parameterized as:

$p_\theta(x_{t-1}\mid x_t) = \mathcal{N}\bigl(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(t)\bigr)$

where typically $\Sigma_\theta(t) = \beta_t I$ and

$\mu_\theta(x_t, t) = \frac{1}{\sqrt{1-\beta_t}} \Bigl(x_t - \frac{\beta_t}{\sqrt{1-\alpha_t}} \epsilon_\theta(x_t, t)\Bigr)$

with an objective:

$L_{simple} = \mathbb{E}_{x_0, t, \epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2].$

2.2 Stage 1: Coarse Geometry Generation

Coarse structure synthesis operates on VAE latent tokens ( $x_0^{(1)}$ ; 4096–10240 tokens) using a DiT transformer (backbone: Hunyuan3D-2.1). Conditioning occurs via DINOv2 cross-attention to one or more input images, resulting in a low-frequency SDF field $S_{coarse}(\mathbf{u})$ on a regular grid (e.g., $512^3$ ). No additional training is performed beyond the pretrained backbone.

Refinement is performed on a fixed set of voxel queries $V = \{v_i\}_{i=1}^N$ (typically $128^3$ ) extracted from the coarse SDF. For each $v_i$ , both $S_{coarse}(v_i)$ and image-alignment cues are obtained. These are treated as initial latent tokens $x_0^{(2)}$ and diffused forward, with denoising learning local corrections:

$q(x_t^{(2)}\mid x_{t-1}^{(2)}),\qquad p_\theta^{(2)}(x_{t-1}^{(2)}\mid x_t^{(2)}, \{v_i\}, I)$

Decoupling global positioning permits the model to focus exclusively on synthesis of fine, high-frequency details within the local spatial context.

3. Data Processing Pipeline

3.1 Watertight Processing: Sparse-Voxel Watershed

Meshes are voxelized into sparse, CUDA-accelerated grids of up to $2048^3$ resolution. The hole-closing procedure identifies gaps and cracks via flood-fill and fills them automatically. Open-surface regions lacking enclosed volume are thickened volumetrically prior to SDF computation. The resulting signed distance field is computed in sparse form and a clean isosurface is extracted using dual marching cubes. Unlike prior visibility-based or pure flood-fill methods, this pipeline closes large holes without inducing noisy surface artifacts.

3.2 High-Quality Data Filtering

Beginning with approximately 800 K models from Objaverse, three filtering stages are performed:

Vision-LLM (VLM)-Based Filtering: Multi-view depth and normal renders are analyzed by VLMs to filter out primitives, ground planes, and noisy scans.
Pose Normalization: A learned canonicalization network, with VLM validation, realigns models to a consistent orientation.
Geometry Filtering: Interior versus exterior point ratios identify thin shells; fragmented shapes are excluded via VAE reconstruction quality.

After spot-checking, this process yields roughly 330 K valid meshes (120 K high quality), forming the training dataset.

4. Spatial Localization and Detail Synthesis

4.1 Decoupling Spatial Localization

Fixed-grid voxel sampling ( $128^3$ resolution) provides explicit spatial anchoring, transforming each token’s learning objective from global positioning to precise local SDF correction. This restricts the solution space and enables targeted refinement.

4.2 Rotary Positional Encoding (RoPE) for 3D Localization

RoPE is employed per Su et al.’s RoFormer for encoding coordinates in self-attention transformers. For $v = (x, y, z)$ and per-axis embedding dimension $d$ :

Define $\theta_k = 10000^{-2k/d}$ for $k \in \{0, \dots, d/2-1\}$
For each 2-dimensional subvector $[p_{2k}, p_{2k+1}]$ :

$R(v)_{2k} = p_{2k} \cos(\theta_k t_v) - p_{2k+1} \sin(\theta_k t_v)\ R(v)_{2k+1} = p_{2k} \sin(\theta_k t_v) + p_{2k+1} \cos(\theta_k t_v)$

with $t_v$ as timestep or layer index. Concatenated across axes, the result is used in attention layers for position-dependent refinement.

5. Training Protocols and Implementation Specifics

The framework is trained on 330 K curated meshes (120 K high quality) from Objaverse, with each object rendered in 16 views ( $1024^2$ RGB). Training is performed on 8 NVIDIA H20 GPUs (batch size 32).

Stage 1: Hunyuan3D-2.1 VAE and DiT backbones are not retrained.
Stage 2:
- VAE fine-tuning uses 55 K steps, uniform query perturbation in $[-1/128, 1/128]$ , increasing token count (4096 $\rightarrow$ 8192).
- DiT fine-tuning employs progressive schedules:
- Stage A: 4096 tokens, $512^2$ images, 10 K steps
- Stage B: 8192 tokens, $1024^2$ images, 15 K steps
- Stage C: 10 240 tokens, $1024^2$ images, 60 K steps
- Inference: up to 32 768 tokens, with background masking

Losses consist of $\ell_2$ diffusion $\epsilon$ -prediction and VAE reconstruction ( $\ell_1 / \ell_2$ ) sampled at 1.6 million supervision points. AdamW is used for optimization, learning rate $\sim$ 1e-4 with cosine decay.

6. Evaluation: Quantitative and Qualitative Results

6.1 Quantitative Metrics

Performance is assessed via Chamfer Distance (CD) and Intersection-over-Union (IoU) scores. UltraShape 1.0 demonstrates $\sim$ 20–30% lower CD and $\sim$ 5–10% higher IoU compared to open-source baselines such as CLAY, TripoSG, and Sparc3D under matched rendering conditions.

6.2 Qualitative Comparisons

Empirically, outputs exhibit sharper edges, stronger fine detail (notably in features such as chair spokes and gear teeth), and superior image alignment relative to both open-source DiT/vector-set (Hunyuan3D, CLAY, FlashVDM), sparse-voxel (TripoSF, TRELLIS.2), and proprietary systems (DreamFusion, ProlificDreamer).

6.3 Scalability

VAE reconstruction fidelity improves monotonically with increasing token count (4096 $\rightarrow$ 32 768) and DiT refinement approaches geometric detail parity with doubled training budgets at inference. This suggests stable scalability in both stages.

7. Limitations and Prospective Directions

UltraShape 1.0’s reliance on watertight preprocessing may be challenged by meshes with intricate internal cavities or ultrathin, perforated shells. Image conditioning is sensitive to segmentation errors and background artifacts, indicating the need for enhanced 2D/3D segmentation and foreground isolation. Computational requirements scale with token count, making two-stage inference costly for ultra-high-resolution outputs.

Potential future directions include integration of text or multimodal conditioning, adaptive voxel grids (octree, hash-grid) for enhanced local detail synthesis, end-to-end joint stage training, and leveraging simulators or CAD constraints for functional geometry generation (Jia et al., 24 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UltraShape 1.0.

UltraShape 1.0: 3D Diffusion for Mesh Synthesis

1. Motivation and Architectural Overview

2. Two-Stage Diffusion Pipeline and Mathematical Formulation

2.1 Diffusion Model Fundamentals

2.2 Stage 1: Coarse Geometry Generation

2.3 Stage 2: Voxel-Conditioned Geometric Refinement

3. Data Processing Pipeline

3.1 Watertight Processing: Sparse-Voxel Watershed

3.2 High-Quality Data Filtering

4. Spatial Localization and Detail Synthesis

4.1 Decoupling Spatial Localization

4.2 Rotary Positional Encoding (RoPE) for 3D Localization

5. Training Protocols and Implementation Specifics

6. Evaluation: Quantitative and Qualitative Results

6.1 Quantitative Metrics

6.2 Qualitative Comparisons

6.3 Scalability

7. Limitations and Prospective Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

UltraShape 1.0: 3D Diffusion for Mesh Synthesis

1. Motivation and Architectural Overview

2. Two-Stage Diffusion Pipeline and Mathematical Formulation

2.1 Diffusion Model Fundamentals

2.2 Stage 1: Coarse Geometry Generation

2.3 Stage 2: Voxel-Conditioned Geometric Refinement

3. Data Processing Pipeline

3.1 Watertight Processing: Sparse-Voxel Watershed

3.2 High-Quality Data Filtering

4. Spatial Localization and Detail Synthesis

4.1 Decoupling Spatial Localization

4.2 Rotary Positional Encoding (RoPE) for 3D Localization

5. Training Protocols and Implementation Specifics

6. Evaluation: Quantitative and Qualitative Results

6.1 Quantitative Metrics

6.2 Qualitative Comparisons

6.3 Scalability

7. Limitations and Prospective Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research