- The paper presents a novel diffusion-based architecture that leverages bounding box prompts and point-wise semantic features to achieve high-fidelity, semantically accurate 3D shape decomposition.
- It employs a synchronized multi-part diffusion framework with intra- and inter-part attention mechanisms to enhance geometric fidelity and boundary coherence.
- The approach outperforms existing methods in Chamfer Distance and F-Score metrics while supporting interactive part editing for applications such as mesh retopology and UV mapping.
High-Fidelity and Structure-Coherent Shape Decomposition with X-Part
Introduction
The paper "X-Part: high fidelity and structure coherent shape decomposition" (2509.08643) presents a diffusion-based framework for decomposing holistic 3D objects into semantically meaningful and structurally coherent parts, with a focus on geometric fidelity and controllability. The method addresses limitations in prior part-based 3D generative models, which often suffer from poor semantic decomposition, ambiguous boundaries, and limited user control. X-Part leverages bounding box prompts and point-wise semantic features to guide the decomposition process, enabling interactive part editing and supporting downstream applications such as mesh retopology and UV mapping.
Figure 1: Qualitative results of X-Part demonstrating high-fidelity part decomposition and geometric coherence across diverse 3D assets.
Methodology
Vecset-Based Latent Diffusion Framework
X-Part builds upon a vecset-based 3D latent diffusion architecture, utilizing a transformer-based VAE for encoding point clouds into latent tokens. The encoder processes sharp-edge-aware sampled point clouds, mapping them into a latent space suitable for part-level geometry representation. The decoder reconstructs signed distance fields (SDFs) from latent tokens, enabling high-fidelity mesh generation. The VAE is fine-tuned on a large-scale part-level dataset to enhance its capacity for part-wise geometry encoding.
Semantic-Aware Conditioning and Bounding Box Prompts
A key innovation is the use of bounding boxes as decomposition prompts, providing coarse spatial guidance for part location and scale. This mitigates overfitting to segmentation masks and improves robustness to segmentation inaccuracies. Point-wise semantic features extracted from P3-SAM are concatenated with shape tokens, supplying rich semantic cues for decomposition. During training, bounding boxes are randomly perturbed to further enhance robustness.
Synchronized Multi-Part Diffusion
X-Part generates latent tokens for all object parts in a synchronized fashion, employing intra-part and inter-part self-attention mechanisms. Intra-part attention captures local geometric context, while inter-part attention extends the receptive field to all part tokens, improving boundary coherence and global structure. Cross-attention layers inject both object-level and part-level conditions, preserving geometric details and structural consistency. A learnable part embedding codebook introduces distinctiveness among part latents, supporting decomposition of objects with a large number of parts.
Figure 2: Architecture of X-Part, illustrating the flow from input point cloud and bounding box prompts through semantic feature extraction, multi-part latent diffusion, and part-wise mesh generation.
Editable Part Generation Pipeline
The framework supports interactive part editing via bounding box manipulation. Users can split, merge, or adjust bounding boxes to control part decomposition, with the diffusion process resampling and denoising affected part latents. This enables intuitive, production-ready editing workflows for 3D asset creation.
Experimental Results
Quantitative and Qualitative Evaluation
X-Part is evaluated on the ObjaversePart-Tiny dataset, using Chamfer Distance (CD) and F-Score at multiple thresholds to assess geometric quality. The method achieves superior performance compared to segmentation-based (SAMPart3D, PartField) and generative (HoloPart, OmniPart, PartCrafter, PartPacker) baselines, with lower CD and higher F-Score metrics at both part and object levels. Notably, X-Part produces more refined and semantically reasonable part decompositions, often generating a larger number of meaningful parts.
Strong numerical results:
- Part-level CD: 0.11 (vs. 0.15–0.26 for baselines)
- Part-level Fscore-0.1: 0.80 (vs. 0.59–0.73 for baselines)
- Object-level CD: 0.08 (matching OmniPart, outperforming others)
- Object-level Fscore-0.1: 0.92 (highest among all methods)
Contradictory claim: Even when OmniPart is supplied with ground-truth 2D masks, X-Part outperforms it in decomposition quality.
Ablation Studies
Ablation experiments validate the contributions of each module. Removal of part embedding, object/part conditioning, semantic features, or inter-part attention degrades performance, confirming their necessity for high-fidelity decomposition and structural coherence.
Downstream Applications
X-Part facilitates part-aware UV unwrapping, mesh retopology, and interactive editing. Decomposing meshes into parts simplifies UV mapping, yielding more compact and semantically meaningful UV layouts. The bounding box-based editing pipeline enables efficient, user-driven asset manipulation.
(Figure 3)
Figure 3: Application examples of X-Part, including bounding box-controlled part generation and improved UV unwrapping via part-based decomposition.
Implementation Considerations
- Architecture: 21 DiT blocks, 512 tokens per part, 2048 tokens for object/part conditions, 50-entry part embedding codebook, MoE for output layers.
- Training: Adam optimizer, $1e-4$ learning rate, gradient clipping, 128 H20 GPUs, 4 days training, semantic feature dropout (0.3), condition dropout (0.1), bounding box augmentation.
- Dataset: 2.3M objects from P3-SAM, remeshed to watertight meshes for robust training pairs.
- Scalability: Supports up to 50 parts per object; inference time increases with part count due to simultaneous processing.
Limitations and Future Directions
X-Part relies solely on geometric and semantic cues, lacking physical regularization, which may restrict its applicability in scenarios requiring physically plausible decomposition. The simultaneous diffusion of all part latents leads to increased inference time for high-part-count objects, posing challenges for real-time applications. Future work may integrate physical priors, optimize inference for scalability, and explore multimodal conditioning (e.g., text, images) for broader controllability.
Conclusion
X-Part establishes a new paradigm for high-fidelity, structure-coherent 3D shape decomposition, combining bounding box prompts and semantic features within a synchronized diffusion framework. The method achieves state-of-the-art geometric quality and controllability, supporting interactive editing and downstream 3D asset creation tasks. Its modular design and robust performance position it as a practical solution for production-ready, editable 3D content pipelines, with potential for further extension to physically-aware and multimodal generative models.