Papers
Topics
Authors
Recent
2000 character limit reached

Geom-Seg VecSet: Latent 3D Segmentation

Updated 17 December 2025
  • Geom-Seg VecSet is a unified latent representation that encodes both geometry and part-level segmentation for fine-grained 3D object generation.
  • It employs a transformer-based encoder with cross-attention and specialized decoders to predict geometry, segmentation, and latent anchor positions in one framework.
  • The approach integrates dual-space latent diffusion to balance global object structure with local part details, eliminating the need for external segmentation models.

Geom-Seg VecSet is a unified latent representation for 3D point-clouds and part-level segmentation, proposed for controllable, decomposable 3D object generation in the context of latent diffusion frameworks. The method encodes both geometry and segmentation structure into a compact set of latent vectors, enabling fine-grained, promptable segmentation and shape generation without requiring external segmenters or separated part-supervision. Geom-Seg VecSet serves as the core representational and interface component of the UniPart framework for part-level 3D generation (He et al., 10 Dec 2025).

1. Formal Definition and Mathematical Formulation

For an object mesh O\mathcal{O} with NN parts, Geom-Seg VecSet begins by uniformly sampling CC points from O\mathcal{O}, assembling the set

P={(xk,nk,sk)R7}k=1CP = \left\{ (x_k, n_k, s_k) \in \mathbb{R}^7 \right\}_{k=1}^C

where xkR3x_k \in \mathbb{R}^3 is position, nkR3n_k \in \mathbb{R}^3 is the normal, and sks_k is a one-hot encoding for part label (1,,N1,\ldots,N). A transformer-based encoder E\mathcal{E} maps PP to Z={zi}i=1LZ = \{z_i\}_{i=1}^L, where ziRdz_i \in \mathbb{R}^d and LL is the number of latent tokens. The joint set ZRL×dZ \in \mathbb{R}^{L \times d} constitutes a Geom-Seg VecSet. Each ziz_i reflects both geometric properties and local part membership.

Mathematically, the framework supports three decoding tasks:

  • Geometry decoding for query point qq:

f^(q)=Dgeom(Z,q)f(q)\hat f(q) = \mathcal{D}_{\rm geom}(Z, q) \approx f(q)

where f(q)f(q) is the ground-truth implicit field.

  • Segmentation decoding (promptable) for token index rr:

m^=Dseg(Z,r){1,,N}L\hat m = \mathcal{D}_{\rm seg}(Z, r) \in \{1, \ldots, N\}^L

assigning part labels to latent tokens.

  • Latent anchor position decoding (auxiliary):

pilatent=Dpos(zi)R3p_i^{\rm latent} = \mathcal{D}_{\rm pos}(z_i) \in \mathbb{R}^3

estimating anchor positions for each ziz_i.

2. Encoder/Decoder Architecture

Encoder (E\mathcal{E})

  • Input: PRC×7P \in \mathbb{R}^{C \times 7}.
  • Architecture:
    • Initial cross-attention layer with LL learnable queries (QRL×dQ \in \mathbb{R}^{L \times d}) attending over the point set.
    • KK stacked blocks of multi-head self-attention (with hh heads, dd channel width) and MLPs, each followed by layer normalization and residual connections.
  • Output: Geom-Seg VecSet ZRL×dZ \in \mathbb{R}^{L \times d}.

Decoders

  • Geometry Decoder (Dgeom\mathcal{D}_{\rm geom}): 8-layer MLP (hidden width 512, ReLU) with cross-attention mechanisms between latents and spatial query points.
  • Segmentation Decoder (Dseg\mathcal{D}_{\rm seg}): Based on promptable SAM2 segmentation, employs a transformer head taking (token, index embedding), outputs logits over part labels.
  • Position Decoder (Dpos\mathcal{D}_{\rm pos}): 3-layer MLP, trained to regress anchor point via L2L_2 loss.

3. Training Objectives and Procedures

Training employs a VAE framework, with the total objective: L=Lrecon+Lseg+λklLkl\mathcal{L} = \mathcal{L}_{\rm recon} + \mathcal{L}_{\rm seg} + \lambda_{\rm kl} \mathcal{L}_{\rm kl}

  • Lrecon\mathcal{L}_{\rm recon}: Squared error over predicted and ground-truth implicit fields.
  • Lseg\mathcal{L}_{\rm seg}: Cross-entropy segmentation loss, following SAM2.
  • Lkl\mathcal{L}_{\rm kl}: KL divergence term for latent regularization; λkl\lambda_{\rm kl} controls balance.

Pretraining typically occurs on geometry-only data; fine-tuning incorporates segmentation supervision. The design encourages the emergence of part-level structure during joint geometry encoding without external part annotations for each training step.

4. Integration into Two-Stage Latent Diffusion Pipeline

Geom-Seg VecSet enables latent-level control in the UniPart two-stage diffusion process:

Stage 1: Whole-object diffusion and latent segmentation

  • Uses a DiT (Diffusion Transformer) backbone with rectified flow matching.
  • Latent trajectory: Zt=(1t)Z0+tϵZ_t = (1 - t)Z_0 + t\epsilon, t[0,1]t \in [0,1], ϵN(0,I)\epsilon \sim \mathcal{N}(0,I).
  • Training minimizes flow-matching loss conditioned on input image.
  • After diffusion, segmentation tokens are extracted via frozen Dseg\mathcal{D}_{\rm seg} and Dpos\mathcal{D}_{\rm pos} for part assignment, then post-processed for mask generation.

Stage 2: Part-level diffusion with dual-space conditioning

  • Each part ii obtains dual latents in global coordinate space (GCS) and normalized canonical space (NCS):

Xi=(Xigcs,Xincs)RL×2dX_i^* = (X_i^{\rm gcs}, X_i^{\rm ncs}) \in \mathbb{R}^{L \times 2d}

  • DiT ω\omega predicts part flows, conditioned on (I,Z^0,Xi)(I, \hat{Z}_0, X_i).
  • Each transformer block fuses local (per-space) and global (cross-space) attention among part latents.

A dual-space approach enforces both global composition and localized detail preservation at the part level.

5. Sampling and Mesh Decoding Workflow

Given an input image II, the generative pipeline proceeds as:

  1. Apply whole-object latent diffusion to obtain Z^0\hat{Z}_0.
  2. Perform latent segmentation, resulting in NN sets {Xi}\{X_i\}.
  3. For each part ii:
    • Initialize noise Xi,1N(0,I)X_{i,1}^* \sim \mathcal{N}(0,I).
    • Denoise via part-level DiT to yield XigcsX_i^{\rm gcs}, XincsX_i^{\rm ncs}.
    • Decode implicit field mesh: Migcs(q)\mathcal{M}_i^{\rm gcs}(q) and Mincs(q)\mathcal{M}_i^{\rm ncs}(q).
    • Compute rigid transform TiT_i in GCS, then apply to NCS mesh.
    • Compose all part meshes into the final 3D object.

This approach avoids repeated lossy marching cubes re-encoding, increases consistency of part-geometry mapping, and provides direct control over part-level generation at the latent stage.

6. Properties, Benefits, and Limitations

Benefits

  • Emergent segmentation: Part-awareness is learned “for free” through geometry-centric objectives.
  • Fine-grained granularity: Latent segmentation enables precise control and adapts to varying object complexity.
  • No external segmenter: The model does not require separate part-annotated data or models at inference.
  • Dual-space diffusion: Maintains coherence and fidelity for both holistic object shape and localized part attributes.
  • Efficient training: Domain-aligned latent conditioning enhances diffusion efficiency and entangles part–whole relations.

Limitations and Open Challenges

  • Currently limited to single-object synthesis; full-scene composition is not addressed.
  • Relies on point-set latent format; extension to voxel or mesh-graph VecSets (e.g., SLAT) would generalize application.
  • The segmentation head is fixed-size (LL tokens); scalability to extremely complex objects may require dynamic token allocation.
  • Segmentation accuracy for thin or fine parts is not explicitly regularized; some topological degradation may occur.
  • Absence of adversarial losses or explicit topological priors; part consistency relies solely on geometric and segmentation objectives.

7. Comparative Context and Applications

Geom-Seg VecSet marks a shift from prior approaches—either implicit, non-controllable part segmentation, or reliance on external semantic masks—toward a unified, latent-based geometry-segmentation paradigm with promptable control. Applications span part-level generative design, interactive 3D editing, robotic manipulation where part decomposition is crucial, and any downstream task benefiting from decomposable, high-fidelity 3D synthesis with semantic control (He et al., 10 Dec 2025).

A plausible implication is that the VecSet format, through its compositionality and promptability, offers a generalizable scaffold for integrating discrete part structure into continuous generative models, providing a pathway toward scene-level compositional synthesis and broader generative 3D understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Geom-Seg VecSet.