Geom-Seg VecSet: Latent 3D Segmentation
- Geom-Seg VecSet is a unified latent representation that encodes both geometry and part-level segmentation for fine-grained 3D object generation.
- It employs a transformer-based encoder with cross-attention and specialized decoders to predict geometry, segmentation, and latent anchor positions in one framework.
- The approach integrates dual-space latent diffusion to balance global object structure with local part details, eliminating the need for external segmentation models.
Geom-Seg VecSet is a unified latent representation for 3D point-clouds and part-level segmentation, proposed for controllable, decomposable 3D object generation in the context of latent diffusion frameworks. The method encodes both geometry and segmentation structure into a compact set of latent vectors, enabling fine-grained, promptable segmentation and shape generation without requiring external segmenters or separated part-supervision. Geom-Seg VecSet serves as the core representational and interface component of the UniPart framework for part-level 3D generation (He et al., 10 Dec 2025).
1. Formal Definition and Mathematical Formulation
For an object mesh with parts, Geom-Seg VecSet begins by uniformly sampling points from , assembling the set
where is position, is the normal, and is a one-hot encoding for part label (). A transformer-based encoder maps to , where and is the number of latent tokens. The joint set constitutes a Geom-Seg VecSet. Each reflects both geometric properties and local part membership.
Mathematically, the framework supports three decoding tasks:
- Geometry decoding for query point :
where is the ground-truth implicit field.
- Segmentation decoding (promptable) for token index :
assigning part labels to latent tokens.
- Latent anchor position decoding (auxiliary):
estimating anchor positions for each .
2. Encoder/Decoder Architecture
Encoder ()
- Input: .
- Architecture:
- Initial cross-attention layer with learnable queries () attending over the point set.
- stacked blocks of multi-head self-attention (with heads, channel width) and MLPs, each followed by layer normalization and residual connections.
- Output: Geom-Seg VecSet .
Decoders
- Geometry Decoder (): 8-layer MLP (hidden width 512, ReLU) with cross-attention mechanisms between latents and spatial query points.
- Segmentation Decoder (): Based on promptable SAM2 segmentation, employs a transformer head taking (token, index embedding), outputs logits over part labels.
- Position Decoder (): 3-layer MLP, trained to regress anchor point via loss.
3. Training Objectives and Procedures
Training employs a VAE framework, with the total objective:
- : Squared error over predicted and ground-truth implicit fields.
- : Cross-entropy segmentation loss, following SAM2.
- : KL divergence term for latent regularization; controls balance.
Pretraining typically occurs on geometry-only data; fine-tuning incorporates segmentation supervision. The design encourages the emergence of part-level structure during joint geometry encoding without external part annotations for each training step.
4. Integration into Two-Stage Latent Diffusion Pipeline
Geom-Seg VecSet enables latent-level control in the UniPart two-stage diffusion process:
Stage 1: Whole-object diffusion and latent segmentation
- Uses a DiT (Diffusion Transformer) backbone with rectified flow matching.
- Latent trajectory: , , .
- Training minimizes flow-matching loss conditioned on input image.
- After diffusion, segmentation tokens are extracted via frozen and for part assignment, then post-processed for mask generation.
Stage 2: Part-level diffusion with dual-space conditioning
- Each part obtains dual latents in global coordinate space (GCS) and normalized canonical space (NCS):
- DiT predicts part flows, conditioned on .
- Each transformer block fuses local (per-space) and global (cross-space) attention among part latents.
A dual-space approach enforces both global composition and localized detail preservation at the part level.
5. Sampling and Mesh Decoding Workflow
Given an input image , the generative pipeline proceeds as:
- Apply whole-object latent diffusion to obtain .
- Perform latent segmentation, resulting in sets .
- For each part :
- Initialize noise .
- Denoise via part-level DiT to yield , .
- Decode implicit field mesh: and .
- Compute rigid transform in GCS, then apply to NCS mesh.
- Compose all part meshes into the final 3D object.
This approach avoids repeated lossy marching cubes re-encoding, increases consistency of part-geometry mapping, and provides direct control over part-level generation at the latent stage.
6. Properties, Benefits, and Limitations
Benefits
- Emergent segmentation: Part-awareness is learned “for free” through geometry-centric objectives.
- Fine-grained granularity: Latent segmentation enables precise control and adapts to varying object complexity.
- No external segmenter: The model does not require separate part-annotated data or models at inference.
- Dual-space diffusion: Maintains coherence and fidelity for both holistic object shape and localized part attributes.
- Efficient training: Domain-aligned latent conditioning enhances diffusion efficiency and entangles part–whole relations.
Limitations and Open Challenges
- Currently limited to single-object synthesis; full-scene composition is not addressed.
- Relies on point-set latent format; extension to voxel or mesh-graph VecSets (e.g., SLAT) would generalize application.
- The segmentation head is fixed-size ( tokens); scalability to extremely complex objects may require dynamic token allocation.
- Segmentation accuracy for thin or fine parts is not explicitly regularized; some topological degradation may occur.
- Absence of adversarial losses or explicit topological priors; part consistency relies solely on geometric and segmentation objectives.
7. Comparative Context and Applications
Geom-Seg VecSet marks a shift from prior approaches—either implicit, non-controllable part segmentation, or reliance on external semantic masks—toward a unified, latent-based geometry-segmentation paradigm with promptable control. Applications span part-level generative design, interactive 3D editing, robotic manipulation where part decomposition is crucial, and any downstream task benefiting from decomposable, high-fidelity 3D synthesis with semantic control (He et al., 10 Dec 2025).
A plausible implication is that the VecSet format, through its compositionality and promptability, offers a generalizable scaffold for integrating discrete part structure into continuous generative models, providing a pathway toward scene-level compositional synthesis and broader generative 3D understanding.