Geom-Seg VecSet: Latent 3D Segmentation

Updated 17 December 2025

Geom-Seg VecSet is a unified latent representation that encodes both geometry and part-level segmentation for fine-grained 3D object generation.
It employs a transformer-based encoder with cross-attention and specialized decoders to predict geometry, segmentation, and latent anchor positions in one framework.
The approach integrates dual-space latent diffusion to balance global object structure with local part details, eliminating the need for external segmentation models.

Geom-Seg VecSet is a unified latent representation for 3D point-clouds and part-level segmentation, proposed for controllable, decomposable 3D object generation in the context of latent diffusion frameworks. The method encodes both geometry and segmentation structure into a compact set of latent vectors, enabling fine-grained, promptable segmentation and shape generation without requiring external segmenters or separated part-supervision. Geom-Seg VecSet serves as the core representational and interface component of the UniPart framework for part-level 3D generation (He et al., 10 Dec 2025).

1. Formal Definition and Mathematical Formulation

For an object mesh $\mathcal{O}$ with $N$ parts, Geom-Seg VecSet begins by uniformly sampling $C$ points from $\mathcal{O}$ , assembling the set

$P = \left\{ (x_k, n_k, s_k) \in \mathbb{R}^7 \right\}_{k=1}^C$

where $x_k \in \mathbb{R}^3$ is position, $n_k \in \mathbb{R}^3$ is the normal, and $s_k$ is a one-hot encoding for part label ( $1,\ldots,N$ ). A transformer-based encoder $\mathcal{E}$ maps $P$ to $Z = \{z_i\}_{i=1}^L$ , where $z_i \in \mathbb{R}^d$ and $L$ is the number of latent tokens. The joint set $Z \in \mathbb{R}^{L \times d}$ constitutes a Geom-Seg VecSet. Each $z_i$ reflects both geometric properties and local part membership.

Mathematically, the framework supports three decoding tasks:

Geometry decoding for query point $q$ :

$\hat f(q) = \mathcal{D}_{\rm geom}(Z, q) \approx f(q)$

where $f(q)$ is the ground-truth implicit field.

Segmentation decoding (promptable) for token index $r$ :

$\hat m = \mathcal{D}_{\rm seg}(Z, r) \in \{1, \ldots, N\}^L$

assigning part labels to latent tokens.

Latent anchor position decoding (auxiliary):

$p_i^{\rm latent} = \mathcal{D}_{\rm pos}(z_i) \in \mathbb{R}^3$

estimating anchor positions for each $z_i$ .

2. Encoder/Decoder Architecture

Encoder ( $\mathcal{E}$ )

Input: $P \in \mathbb{R}^{C \times 7}$ .
Architecture:
- Initial cross-attention layer with $L$ learnable queries ( $Q \in \mathbb{R}^{L \times d}$ ) attending over the point set.
- $K$ stacked blocks of multi-head self-attention (with $h$ heads, $d$ channel width) and MLPs, each followed by layer normalization and residual connections.
Output: Geom-Seg VecSet $Z \in \mathbb{R}^{L \times d}$ .

Decoders

Geometry Decoder ( $\mathcal{D}_{\rm geom}$ ): 8-layer MLP (hidden width 512, ReLU) with cross-attention mechanisms between latents and spatial query points.
Segmentation Decoder ( $\mathcal{D}_{\rm seg}$ ): Based on promptable SAM2 segmentation, employs a transformer head taking (token, index embedding), outputs logits over part labels.
Position Decoder ( $\mathcal{D}_{\rm pos}$ ): 3-layer MLP, trained to regress anchor point via $L_2$ loss.

3. Training Objectives and Procedures

Training employs a VAE framework, with the total objective: $\mathcal{L} = \mathcal{L}_{\rm recon} + \mathcal{L}_{\rm seg} + \lambda_{\rm kl} \mathcal{L}_{\rm kl}$

$\mathcal{L}_{\rm recon}$ : Squared error over predicted and ground-truth implicit fields.
$\mathcal{L}_{\rm seg}$ : Cross-entropy segmentation loss, following SAM2.
$\mathcal{L}_{\rm kl}$ : KL divergence term for latent regularization; $\lambda_{\rm kl}$ controls balance.

Pretraining typically occurs on geometry-only data; fine-tuning incorporates segmentation supervision. The design encourages the emergence of part-level structure during joint geometry encoding without external part annotations for each training step.

4. Integration into Two-Stage Latent Diffusion Pipeline

Geom-Seg VecSet enables latent-level control in the UniPart two-stage diffusion process:

Stage 1: Whole-object diffusion and latent segmentation

Uses a DiT (Diffusion Transformer) backbone with rectified flow matching.
Latent trajectory: $Z_t = (1 - t)Z_0 + t\epsilon$ , $t \in [0,1]$ , $\epsilon \sim \mathcal{N}(0,I)$ .
Training minimizes flow-matching loss conditioned on input image.
After diffusion, segmentation tokens are extracted via frozen $\mathcal{D}_{\rm seg}$ and $\mathcal{D}_{\rm pos}$ for part assignment, then post-processed for mask generation.

Stage 2: Part-level diffusion with dual-space conditioning

Each part $i$ obtains dual latents in global coordinate space (GCS) and normalized canonical space (NCS):

$X_i^* = (X_i^{\rm gcs}, X_i^{\rm ncs}) \in \mathbb{R}^{L \times 2d}$

DiT $\omega$ predicts part flows, conditioned on $(I, \hat{Z}_0, X_i)$ .
Each transformer block fuses local (per-space) and global (cross-space) attention among part latents.

A dual-space approach enforces both global composition and localized detail preservation at the part level.

5. Sampling and Mesh Decoding Workflow

Given an input image $I$ , the generative pipeline proceeds as:

Apply whole-object latent diffusion to obtain $\hat{Z}_0$ .
Perform latent segmentation, resulting in $N$ sets $\{X_i\}$ .
For each part $i$ $i$ :
- Initialize noise $X_{i,1}^* \sim \mathcal{N}(0,I)$ .
- Denoise via part-level DiT to yield $X_i^{\rm gcs}$ , $X_i^{\rm ncs}$ .
- Decode implicit field mesh: $\mathcal{M}_i^{\rm gcs}(q)$ and $\mathcal{M}_i^{\rm ncs}(q)$ .
- Compute rigid transform $T_i$ in GCS, then apply to NCS mesh.
- Compose all part meshes into the final 3D object.

This approach avoids repeated lossy marching cubes re-encoding, increases consistency of part-geometry mapping, and provides direct control over part-level generation at the latent stage.

6. Properties, Benefits, and Limitations

Benefits

Emergent segmentation: Part-awareness is learned “for free” through geometry-centric objectives.
Fine-grained granularity: Latent segmentation enables precise control and adapts to varying object complexity.
No external segmenter: The model does not require separate part-annotated data or models at inference.
Dual-space diffusion: Maintains coherence and fidelity for both holistic object shape and localized part attributes.
Efficient training: Domain-aligned latent conditioning enhances diffusion efficiency and entangles part–whole relations.

Limitations and Open Challenges

Currently limited to single-object synthesis; full-scene composition is not addressed.
Relies on point-set latent format; extension to voxel or mesh-graph VecSets (e.g., SLAT) would generalize application.
The segmentation head is fixed-size ( $L$ tokens); scalability to extremely complex objects may require dynamic token allocation.
Segmentation accuracy for thin or fine parts is not explicitly regularized; some topological degradation may occur.
Absence of adversarial losses or explicit topological priors; part consistency relies solely on geometric and segmentation objectives.

7. Comparative Context and Applications

Geom-Seg VecSet marks a shift from prior approaches—either implicit, non-controllable part segmentation, or reliance on external semantic masks—toward a unified, latent-based geometry-segmentation paradigm with promptable control. Applications span part-level generative design, interactive 3D editing, robotic manipulation where part decomposition is crucial, and any downstream task benefiting from decomposable, high-fidelity 3D synthesis with semantic control (He et al., 10 Dec 2025).

A plausible implication is that the VecSet format, through its compositionality and promptability, offers a generalizable scaffold for integrating discrete part structure into continuous generative models, providing a pathway toward scene-level compositional synthesis and broader generative 3D understanding.

PDF Markdown Chat (Pro)

References (1)

UniPart: Part-Level 3D Generation with Unified 3D Geom-Seg Latents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Geom-Seg VecSet.