Papers
Topics
Authors
Recent
2000 character limit reached

UniPart: Unified 3D & Simulation Frameworks

Updated 17 December 2025
  • UniPart is a family of frameworks for unified part-level modeling, enabling explicit control in 3D object generation and multiphysics simulations.
  • It employs dual-stage latent diffusion and a unified encoder-decoder architecture to integrate geometric and segmentation cues without external models.
  • Its multiphysics extension uses Partition of Unity and peridynamic enrichment to achieve high accuracy in fracture mechanics and effective local-global coupling.

UniPart is a family of frameworks developed for unified part-level modeling and synthesis in both 3D object generation and multiphysics simulation domains. Two prominent instantiations are documented: (1) a variational–peridynamic enrichment of the Partition of Unity Method (PUM) for fracture mechanics (“UniPart” in numerical PDEs) (Birner et al., 2021) and (2) UniPart for image-guided decomposable 3D shape synthesis via a unified geometry–segmentation latent (“Geom-Seg VecSet”) and dual-space latent diffusion (He et al., 10 Dec 2025). Both leverage a global infrastructure in which part-awareness is directly embedded into the representation and computation, enabling explicit localized control (fracture in simulation, part specification and alignment in synthesis) without reliance on external segmentation models or expensive fully global solvers.

1. Unified Geom-Seg Latent Representation (Geom-Seg VecSet)

UniPart for part-level 3D generation is built on top of the VecSet VAE architecture, augmenting it to produce a latent code in which each vector encodes both geometric and part-label information. For a surface mesh O\mathcal{O}, CC sampled points pjR3p_j \in \mathbb{R}^3 are annotated with normals njR3n_j \in \mathbb{R}^3 and part-ID j{1,,N}\ell_j \in \{1,\ldots,N\}, forming the input PRC×7P \in \mathbb{R}^{C \times 7}. The encoder E:PZRL×d\mathcal{E}: P \mapsto Z \in \mathbb{R}^{L \times d} employs a cross-attention block between the input points and LL learnable queries, followed by several self-attention layers.

Decoders are structured as:

  • Geometry decoder Dgeom(Z,q)\mathcal{D}_{\mathrm{geom}}(Z, q) predicting SDF or occupancy at query qq.
  • Segmentation decoder Dseg\mathcal{D}_{\mathrm{seg}} (promptable, following [ravi2024sam2]), mapping latent queries with respect to part prompts to segmentation masks M{0,1}L×NM \in \{0,1\}^{L \times N}.

Training minimizes the composite loss:

Lvecset=Lrecon+Lseg+λklLkl\mathcal{L}_{\mathrm{vecset}} = \mathcal{L}_{\mathrm{recon}} + \mathcal{L}_{\mathrm{seg}} + \lambda_{\mathrm{kl}}\mathcal{L}_{\mathrm{kl}}

where

  • Lrecon=EqDgeom(Z,q)f(q)2\mathcal{L}_{\mathrm{recon}} = \mathbb{E}_q \| \mathcal{D}_{\mathrm{geom}}(Z, q) - f(q) \|^2,
  • Lseg=\mathcal{L}_{\mathrm{seg}} = cross entropy between predicted and ground-truth mask,
  • Lkl\mathcal{L}_{\mathrm{kl}} the latent KL-divergence. The fine-tuning requires only the segmentation decoder; latent capacity (LL, dd) remains unchanged, preserving geometric quality while enriching for part structure (He et al., 10 Dec 2025).

2. Two-Stage Latent Diffusion Framework

The UniPart synthesis pipeline uses a cascade of diffusion transformers (DiT), each stage operating in structured latent space.

Stage 1 (Whole-object): Joint geometry and part segmentation diffusion is performed in RL×d\mathbb{R}^{L \times d}. The forward noising step is Zt=(1t)Z0+tεZ_t = (1-t)Z_0 + t\varepsilon, with εN(0,I)\varepsilon \sim \mathcal{N}(0, I). Conditional flow matching [lipman2022flowmatching] is used to train the denoising velocity field vθ(Zt,tI)v_\theta(Z_t, t | I) with the objective

Lcfm=Et,Z0,εvθ(Zt,tI)(εZ0)2\mathcal{L}_{\mathrm{cfm}} = \mathbb{E}_{t, Z_0, \varepsilon} \| v_\theta(Z_t, t | I) - (\varepsilon - Z_0) \|^2

where II is an RGB image, encoded and cross-attended in each DiT block. Classifier-free guidance is applied by randomly omitting FIF_I with probability 0.1.

After denoising, the segmentation decoder assigns a part label mjm_j to each latent; a position head Dpos\mathcal{D}_{\mathrm{pos}} infers anchor points, and the latents are grouped into soft clusters {Xi}i=1N\{X_i\}_{i=1}^N for each part by FPS and NMS on pairs (pjlatent,mj)(p_j^{\mathrm{latent}}, m_j).

Stage 2 (Part-level): Independent latent diffusion is conducted for each part ii in dual spaces—global coordinate space (gcs) and normalized canonical space (ncs)—with input

Xi=(Xigcs,Xincs)RL×2dX_i^* = \left(X_i^{\mathrm{gcs}}, X_i^{\mathrm{ncs}}\right) \in \mathbb{R}^{L \times 2d}

and conditions (i) image, (ii) whole-object latent Z^\hat Z, (iii) coarse part cluster XiX_i. Space-specific embeddings ese^s are added to tokens before transformer layers.

Attention is interleaved as follows:

  • Local: only within tokens of gcs or ncs subspace.
  • Global: across all part tokens, enforcing cross-space consistency.

3. Dual-Space Generation and Assembled Placement

Decoding both global and canonical latent views for every part yields

  • Migcs=Dgeom(Xigcs)M_i^{\mathrm{gcs}} = \mathcal{D}_{\mathrm{geom}}(X_i^{\mathrm{gcs}})
  • Mincs=Dgeom(Xincs)M_i^{\mathrm{ncs}} = \mathcal{D}_{\mathrm{geom}}(X_i^{\mathrm{ncs}})

The ncs mesh is a canonical [0,1]³ shape; the gcs mesh determines global pose. Similarity transform Ti(x)=diag(maxmin)x+ciT_i(x) = \operatorname{diag}(\max - \min)x + c_i is computed from the gcs bounding box and centroid cic_i. Final part meshes are mapped as M~i=Ti(Mincs)\widetilde{M}_i = T_i(M_i^{\mathrm{ncs}}) and the full object is O^=i=1NM~i\widehat{\mathcal{O}} = \bigcup_{i=1}^N \widetilde{M}_i. This dual-space approach mitigates collapse of fine detail in part meshes and ensures precise part placement (He et al., 10 Dec 2025).

4. Image Conditioning and Transformer-based Cross-Modal Coupling

In both UniPart diffusion stages, DiT blocks alternately apply:

  • Latent token self-attention,
  • Cross-attention to image tokens FI=V(I)F_I = \mathcal{V}(I),
  • Feed-forward layers,
  • (Part stage) Cross-attention to whole-object and part-level latents.

The image encoder V\mathcal{V} is a ViT backbone (here matching Hunyuan3D-2.1). Classifier-free guidance is trained for robustness to missing conditions and is leveraged at inference by linear velocity blending. This paradigm effectively utilizes 2D semantic priors during 3D generation, without external segmenters (He et al., 10 Dec 2025).

5. Quantitative Evaluation and Ablation Studies

On a held-out dataset of 100 shapes, UniPart demonstrates state-of-the-art performance on part-level Chamfer Distance (CD↓) and F-Score (F↑) metrics:

Method CD↓ F₀.₀₅↑ F₀.₁₀↑
HoloPart 0.1492 0.5208 0.7450
OmniPart 0.1453 0.5273 0.7656
PartCrafter 0.1778 0.4749 0.7120
PartPacker 0.1654 0.4715 0.7226
X-Part 0.1533 0.5242 0.7523
UniPart 0.1311 0.5565 0.8052

Segmentation mIoU over generated objects:

Method mIoU↑
SAMesh 0.3608
PartField 0.4167
P3-SAM 0.7046
UniPart 0.7222

Ablation studies reveal that removing the ncs diffusion degrades CD to ≈0.145 and F₀.₀₅ drops by ~4%; omitting local-only attention increases CD to ~0.140; skipping space embeddings results in frequent misassemblies. These findings validate the importance of the staged diffusion, dual-space modeling, and explicit token structuring (He et al., 10 Dec 2025).

6. UniPart in Multiphysics Simulation (PUM–Peridynamic Enrichment)

A structurally separate framework under the UniPart umbrella addresses variational simulation of fracture via multiscale Partition of Unity Methods:

  • The computational domain ΩRd\Omega \subset \mathbb{R}^d is covered by overlapping patches {ωi}\{\omega_i\} with partition functions φi(x)\varphi_i(x), iφi(x)=1\sum_i \varphi_i(x) = 1.
  • The global trial space VPU=iφi(x)ViV^{\mathrm{PU}} = \sum_i \varphi_i(x)V_i, Vi=PiEiV_i = P_i \oplus E_i with basis for polynomials and optional enrichment (Heaviside, Westergaard).
  • Fracture-prone subdomains Ωloc\Omega^{\mathrm{loc}} are modeled by peridynamics (PD), with strong form:

ρ(x)u¨(x,t)=Bδ(x)f(u(x,t)u(x,t),x,x)dVx+b(x,t)\rho(x)\ddot{u}(x,t) = \int_{B_\delta(x)} f( u(x',t) - u(x,t), x', x) dV_{x'} + b(x,t)

  • A global–local enrichment algorithm solves for uPUu^{\mathrm{PU}} over Ω\Omega, hands off subdomain data to PD for local fracture evolution, and reinjects fine-scale crack response and geometry as real-time enrichment functions.
  • This approach enables a unified variational framework accommodating linear elasticity (PU), nonlocal PD fracture, and seamless patch-wise up-/downscaling of solution accuracy (Birner et al., 2021).

7. Theoretical Analysis, Performance, and Implications

UniPart’s unified latent and solver representations yield key technical benefits:

  • The Geom-Seg VecSet demonstrates that joint geometric and segmentation encoding can be trained with no loss in geometric quality, enabling explicit part control and cross-modal conditioning without large annotated part segmentation datasets.
  • Dual-space and hierarchical diffusion strategies result in high-fidelity generation and superior part-level correspondence not achievable with single-stage or global-only methods.
  • In multiphysics simulation, global–local enrichment via PU and PD achieves optimal error rates for polynomials of degree pp:
    • uuhL2=O(hp+1)\|u - u_h\|_{L^2} = O(h^{p+1}), uuhH1=O(hp)\|u - u_h\|_{H^1} = O(h^p), including near singularities (e.g., cracks).
  • Numerical studies confirm that PU and PD models match closely (maximum displacement error <3%, up to 10710^{-7}10810^{-8} m for stationary cracks), with local PD computations offering drastic reductions in compute time relative to global PD (Birner et al., 2021).

This suggests that UniPart paradigms—whether for generative modeling or physical simulation—provide a scalable, explicit, and interpretable methodology for decomposable, part-aware computation and synthesis, eliminating reliance on monolithic black-box systems or expensive end-to-end training for segmentation and local detail.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to UniPart.