UniPart: Unified 3D & Simulation Frameworks

Updated 17 December 2025

UniPart is a family of frameworks for unified part-level modeling, enabling explicit control in 3D object generation and multiphysics simulations.
It employs dual-stage latent diffusion and a unified encoder-decoder architecture to integrate geometric and segmentation cues without external models.
Its multiphysics extension uses Partition of Unity and peridynamic enrichment to achieve high accuracy in fracture mechanics and effective local-global coupling.

UniPart is a family of frameworks developed for unified part-level modeling and synthesis in both 3D object generation and multiphysics simulation domains. Two prominent instantiations are documented: (1) a variational–peridynamic enrichment of the Partition of Unity Method (PUM) for fracture mechanics (“UniPart” in numerical PDEs) (Birner et al., 2021) and (2) UniPart for image-guided decomposable 3D shape synthesis via a unified geometry–segmentation latent (“Geom-Seg VecSet”) and dual-space latent diffusion (He et al., 10 Dec 2025). Both leverage a global infrastructure in which part-awareness is directly embedded into the representation and computation, enabling explicit localized control (fracture in simulation, part specification and alignment in synthesis) without reliance on external segmentation models or expensive fully global solvers.

1. Unified Geom-Seg Latent Representation (Geom-Seg VecSet)

UniPart for part-level 3D generation is built on top of the VecSet VAE architecture, augmenting it to produce a latent code in which each vector encodes both geometric and part-label information. For a surface mesh $\mathcal{O}$ , $C$ sampled points $p_j \in \mathbb{R}^3$ are annotated with normals $n_j \in \mathbb{R}^3$ and part-ID $\ell_j \in \{1,\ldots,N\}$ , forming the input $P \in \mathbb{R}^{C \times 7}$ . The encoder $\mathcal{E}: P \mapsto Z \in \mathbb{R}^{L \times d}$ employs a cross-attention block between the input points and $L$ learnable queries, followed by several self-attention layers.

Decoders are structured as:

Geometry decoder $\mathcal{D}_{\mathrm{geom}}(Z, q)$ predicting SDF or occupancy at query $q$ .
Segmentation decoder $\mathcal{D}_{\mathrm{seg}}$ (promptable, following [ravi2024sam2]), mapping latent queries with respect to part prompts to segmentation masks $M \in \{0,1\}^{L \times N}$ .

Training minimizes the composite loss:

$\mathcal{L}_{\mathrm{vecset}} = \mathcal{L}_{\mathrm{recon}} + \mathcal{L}_{\mathrm{seg}} + \lambda_{\mathrm{kl}}\mathcal{L}_{\mathrm{kl}}$

where

$\mathcal{L}_{\mathrm{recon}} = \mathbb{E}_q \| \mathcal{D}_{\mathrm{geom}}(Z, q) - f(q) \|^2$ ,
$\mathcal{L}_{\mathrm{seg}} =$ cross entropy between predicted and ground-truth mask,
$\mathcal{L}_{\mathrm{kl}}$ the latent KL-divergence. The fine-tuning requires only the segmentation decoder; latent capacity ( $L$ , $d$ ) remains unchanged, preserving geometric quality while enriching for part structure (He et al., 10 Dec 2025).

2. Two-Stage Latent Diffusion Framework

The UniPart synthesis pipeline uses a cascade of diffusion transformers (DiT), each stage operating in structured latent space.

Stage 1 (Whole-object): Joint geometry and part segmentation diffusion is performed in $\mathbb{R}^{L \times d}$ . The forward noising step is $Z_t = (1-t)Z_0 + t\varepsilon$ , with $\varepsilon \sim \mathcal{N}(0, I)$ . Conditional flow matching [lipman2022flowmatching] is used to train the denoising velocity field $v_\theta(Z_t, t | I)$ with the objective

$\mathcal{L}_{\mathrm{cfm}} = \mathbb{E}_{t, Z_0, \varepsilon} \| v_\theta(Z_t, t | I) - (\varepsilon - Z_0) \|^2$

where $I$ is an RGB image, encoded and cross-attended in each DiT block. Classifier-free guidance is applied by randomly omitting $F_I$ with probability 0.1.

After denoising, the segmentation decoder assigns a part label $m_j$ to each latent; a position head $\mathcal{D}_{\mathrm{pos}}$ infers anchor points, and the latents are grouped into soft clusters $\{X_i\}_{i=1}^N$ for each part by FPS and NMS on pairs $(p_j^{\mathrm{latent}}, m_j)$ .

Stage 2 (Part-level): Independent latent diffusion is conducted for each part $i$ in dual spaces—global coordinate space (gcs) and normalized canonical space (ncs)—with input

$X_i^* = \left(X_i^{\mathrm{gcs}}, X_i^{\mathrm{ncs}}\right) \in \mathbb{R}^{L \times 2d}$

and conditions (i) image, (ii) whole-object latent $\hat Z$ , (iii) coarse part cluster $X_i$ . Space-specific embeddings $e^s$ are added to tokens before transformer layers.

Attention is interleaved as follows:

Local: only within tokens of gcs or ncs subspace.
Global: across all part tokens, enforcing cross-space consistency.

3. Dual-Space Generation and Assembled Placement

Decoding both global and canonical latent views for every part yields

$M_i^{\mathrm{gcs}} = \mathcal{D}_{\mathrm{geom}}(X_i^{\mathrm{gcs}})$
$M_i^{\mathrm{ncs}} = \mathcal{D}_{\mathrm{geom}}(X_i^{\mathrm{ncs}})$

The ncs mesh is a canonical [0,1]³ shape; the gcs mesh determines global pose. Similarity transform $T_i(x) = \operatorname{diag}(\max - \min)x + c_i$ is computed from the gcs bounding box and centroid $c_i$ . Final part meshes are mapped as $\widetilde{M}_i = T_i(M_i^{\mathrm{ncs}})$ and the full object is $\widehat{\mathcal{O}} = \bigcup_{i=1}^N \widetilde{M}_i$ . This dual-space approach mitigates collapse of fine detail in part meshes and ensures precise part placement (He et al., 10 Dec 2025).

In both UniPart diffusion stages, DiT blocks alternately apply:

Latent token self-attention,
Cross-attention to image tokens $F_I = \mathcal{V}(I)$ ,
Feed-forward layers,
(Part stage) Cross-attention to whole-object and part-level latents.

The image encoder $\mathcal{V}$ is a ViT backbone (here matching Hunyuan3D-2.1). Classifier-free guidance is trained for robustness to missing conditions and is leveraged at inference by linear velocity blending. This paradigm effectively utilizes 2D semantic priors during 3D generation, without external segmenters (He et al., 10 Dec 2025).

5. Quantitative Evaluation and Ablation Studies

On a held-out dataset of 100 shapes, UniPart demonstrates state-of-the-art performance on part-level Chamfer Distance (CD↓) and F-Score (F↑) metrics:

Method	CD↓	F₀.₀₅↑	F₀.₁₀↑
HoloPart	0.1492	0.5208	0.7450
OmniPart	0.1453	0.5273	0.7656
PartCrafter	0.1778	0.4749	0.7120
PartPacker	0.1654	0.4715	0.7226
X-Part	0.1533	0.5242	0.7523
UniPart	0.1311	0.5565	0.8052

Segmentation mIoU over generated objects:

Method	mIoU↑
SAMesh	0.3608
PartField	0.4167
P3-SAM	0.7046
UniPart	0.7222

Ablation studies reveal that removing the ncs diffusion degrades CD to ≈0.145 and F₀.₀₅ drops by ~4%; omitting local-only attention increases CD to ~0.140; skipping space embeddings results in frequent misassemblies. These findings validate the importance of the staged diffusion, dual-space modeling, and explicit token structuring (He et al., 10 Dec 2025).

6. UniPart in Multiphysics Simulation (PUM–Peridynamic Enrichment)

A structurally separate framework under the UniPart umbrella addresses variational simulation of fracture via multiscale Partition of Unity Methods:

The computational domain $\Omega \subset \mathbb{R}^d$ is covered by overlapping patches $\{\omega_i\}$ with partition functions $\varphi_i(x)$ , $\sum_i \varphi_i(x) = 1$ .
The global trial space $V^{\mathrm{PU}} = \sum_i \varphi_i(x)V_i$ , $V_i = P_i \oplus E_i$ with basis for polynomials and optional enrichment (Heaviside, Westergaard).
Fracture-prone subdomains $\Omega^{\mathrm{loc}}$ are modeled by peridynamics (PD), with strong form:

$\rho(x)\ddot{u}(x,t) = \int_{B_\delta(x)} f( u(x',t) - u(x,t), x', x) dV_{x'} + b(x,t)$

A global–local enrichment algorithm solves for $u^{\mathrm{PU}}$ over $\Omega$ , hands off subdomain data to PD for local fracture evolution, and reinjects fine-scale crack response and geometry as real-time enrichment functions.
This approach enables a unified variational framework accommodating linear elasticity (PU), nonlocal PD fracture, and seamless patch-wise up-/downscaling of solution accuracy (Birner et al., 2021).

7. Theoretical Analysis, Performance, and Implications

UniPart’s unified latent and solver representations yield key technical benefits:

The Geom-Seg VecSet demonstrates that joint geometric and segmentation encoding can be trained with no loss in geometric quality, enabling explicit part control and cross-modal conditioning without large annotated part segmentation datasets.
Dual-space and hierarchical diffusion strategies result in high-fidelity generation and superior part-level correspondence not achievable with single-stage or global-only methods.
In multiphysics simulation, global–local enrichment via PU and PD achieves optimal error rates for polynomials of degree $p$ $p$ :
- $\|u - u_h\|_{L^2} = O(h^{p+1})$ , $\|u - u_h\|_{H^1} = O(h^p)$ , including near singularities (e.g., cracks).
Numerical studies confirm that PU and PD models match closely (maximum displacement error <3%, up to $10^{-7}$ – $10^{-8}$ m for stationary cracks), with local PD computations offering drastic reductions in compute time relative to global PD (Birner et al., 2021).

This suggests that UniPart paradigms—whether for generative modeling or physical simulation—provide a scalable, explicit, and interpretable methodology for decomposable, part-aware computation and synthesis, eliminating reliance on monolithic black-box systems or expensive end-to-end training for segmentation and local detail.