Part-level 3D Generation Methods

Updated 20 September 2025

Part-level 3D generation is a method that synthesizes 3D shapes by decomposing objects into semantically meaningful parts, enabling localized editing and enhanced controllability.
It employs advanced methodologies including sequence models, transformers, diffusion frameworks, and compositional latent spaces to achieve precise geometric and structural modeling.
Applications span graphics, VR, modular robotics, and CAD, illustrating the approach's impact on scalable, interactive, and semantically rich 3D asset customization.

Part-level 3D generation refers to the class of methods that synthesize 3D shapes by explicitly modeling and generating their semantically meaningful parts, rather than treating objects as holistic, monolithic entities. Part-based modeling allows structures to be decomposed, manipulated, and recomposed at the component level, supporting enhanced controllability, semantic interpretability, and localized editing. The field has rapidly evolved, encompassing sequence models, transformers, diffusion models, cross-modal pipelines, and compositional latent spaces to address challenges of granularity, structure, and user control.

1. Foundational Paradigms: Sequential and Part-Disentangled Generation

Early frameworks such as PQ-NET introduced the paradigm of representing 3D shapes as sequences of parts (Wu et al., 2019). Each 3D shape is segmented into parts, each normalized and encoded into a compact feature (using a CNN-based autoencoder on SDFs), then concatenated with geometric (bounding box) and part-index information. A bidirectional Seq2Seq RNN encoder aggregates the sequence into a fixed-size latent vector capturing both fine part geometry and inter-part structure. Decoding is performed sequentially by an RNN that predicts part geometry and placement at each step, halting with a learned stop signal. This approach enables:

Explicit control over part geometry and structure.
Tasks such as autoencoding, interpolation, GAN-based novel generation, and single-view reconstruction.

MRGAN extends part disentanglement by introducing a multi-rooted generator where each root and associated Tree-GCN branch independently grows a semantic part without requiring part-based supervision (Gal et al., 2020). Losses (convexity, root-dropping, triplet, and reconstruction) enforce part separation and encourage semantic plausibility.

2. Hierarchical and Conditional Generation

With the recognition that many 3D objects are naturally represented via hierarchical part arrangements, subsequent works such as LSD-StructureNet (Roberts et al., 2021) factorize the generative process across levels of structural detail. Each hierarchy depth is endowed with its own probabilistic latent space and decoder, supporting conditional regeneration of sub-hierarchies without modifying higher-level structure. The encoder leverages a GCN to aggregate part features per hierarchy level, coordinated by LSTMs to capture inter-level dependencies. This design is especially well-suited to CAD and scenarios where hierarchical part modification is central.

Hierarchical conditioning is further advanced in HierOctFusion (Gao et al., 14 Aug 2025), where a multi-scale octree diffusion backbone integrates semantic part features (extracted by segmentation models) into the octree generation via cross-attention. Message passing from parts to the whole at both coarse and fine scales enables detailed, structurally accurate generation, with explicit control over the semantic hierarchy.

3. Diffusion, Compositional, and Latent-Space Factorization Approaches

Recent part-level 3D generation leverages diffusion models and explicit factorization:

DiffFacto models part styles and part configurations as independent distributions, with cross-diffusion attention between part-latents and the generated point cloud to ensure local and global coherence (Nakayama et al., 2023). Part styles are sampled independently, then a conditional configuration model learns plausible placements.
SALAD applies cascaded diffusion in both extrinsic (structure) and intrinsic (detail) part latent spaces (Koo et al., 2023). Parts are represented as vectors comprising spatial and orientation attributes, facilitating part completion, mixing, and text/condition-guided editing.
PartCrafter works in a compositional latent space, encoding each part as a set of tokens augmented with identity embeddings. Hierarchical attention is then used to combine intra-part (local) and inter-part (global) information during diffusion-based mesh generation (Lin et al., 5 Jun 2025).
CoPart explicitly models inter- and intrapart relationships using contextual part latents, mutual guidance, and a large-scale part-annotated dataset. Cross-attention blocks synchronize geometric and rendered image tokens across parts, while transformer-based global and per-part attention enforces compositional alignment (Dong et al., 11 Jul 2025).

Autoregressive architectures, e.g., PASTA (Li et al., 18 Jul 2024), directly sequence part generation (cuboidal primitives) with a transformer and then employ a blending network for mesh synthesis. The latent sequences encode both semantic and geometric part attributes, supporting granular editing and diverse conditioning.

4. Control, Editing, and User-Guided Generation

Part-level frameworks enable local manipulation and editing via disentangled latent representations or explicit part assignments:

PartNeRF represents objects with locally-defined NeRFs per part, each governed by individual latent codes and affine transformations (Tertikas et al., 2023). Hard ray–part assignment ensures that edits to one part (rigid or non-rigid) do not affect others, enabling mixing, local transformation, and texture modification.
RoMaP introduces robust 3D mask generation (3D-GALP) for Gaussian Splatting, employing spherical harmonics for view-dependent label prediction and soft-label consistency losses (Kim et al., 15 Jul 2025). A regularized SDS loss (adding L1 anchor via SLaMP editing and prior removal) allows precise, localized editing without global artifacts.
OmniPart employs a two-stage pipeline: autoregressive structure planning produces bounding boxes (guided by soft 2D masks for user control over part granularity), then a flow-based denoiser (adapted from holistic 3D generators) synthesizes the parts simultaneously, resolving overlapping voxels via a discarding mechanism (Yang et al., 8 Jul 2025).
Diverse Part Synthesis methods compare multiple generative models (MDN, cGAN, cIMLE, cDDPM) for candidate part generation, supporting interactive, diverse user-driven editing and part replacement (Guan et al., 17 Jan 2024).

Conditioning on external modalities (text, images, or sketches) is central to expanding the usability of part-level 3D generation:

Text and Language Conditioning: Models such as Segment Any 3D-Part in a Scene from a Sentence create the 3D-PU dataset (with >800,000 part-labeled annotations) and OpenPart3D, allowing scene-level fine-grained part segmentation using natural language queries (Wu et al., 24 Jun 2025). Cross-modal vision-language backbone (Florence2) provides open-vocabulary segmentation aligned with 3D superpoint grouping.
Sketch-to-3D frameworks (PASTA) combine sketch and text inputs (through visual-language embedding fusion) and further refine latent queries with dual-graph convolutional networks (ISG-Net) (Lee et al., 17 Mar 2025). This enables precise coverage of ambiguous or underspecified part properties and flexible local editing.
PartGen repurposes multi-view diffusion for both segmentation and completion of occluded parts from text, image, or object input (Chen et al., 24 Dec 2024), enabling decomposable asset representations suitable for subsequent editing.

Some models such as DreamBeast integrate part-level knowledge distillation from 2D diffusion models (e.g., Stable Diffusion 3) into a 3D Part-Affinity implicit representation to guide composition and ensure semantic correspondence with user-specified part prompts (Li et al., 12 Sep 2024).

6. Evaluation, Datasets, and Implications

Benchmarking part-level 3D generation models utilizes metrics at both part and object levels, including:

Intersection-over-Union (IoU), F-Score, and Chamfer Distance for geometric accuracy and part assembly (Chen et al., 17 Jul 2025).
Frechet Distance based on rendering (FID) and mean Average Precision (mAP) for segmentation quality (Gao et al., 14 Aug 2025, Wu et al., 24 Jun 2025).
Human studies and preference scores for qualitative evaluation (especially in part-level editing) (Nakayama et al., 2023).

Key datasets in the field include ShapeNet-Seg (custom part segmentation; (Gao et al., 14 Aug 2025)), 3D-PU (Wu et al., 24 Jun 2025), Partverse (Dong et al., 11 Jul 2025), as well as curated and augmented versions of Objaverse, PartNet, and ShapeNet.

The field enables a spectrum of applications: granular asset customization in graphics and VR, modular robotics, scene compositional generation, design automation in CAD, and open-vocabulary query for interactive editing. The architectures favoring compositional latent spaces, disentangled part-wise representations, and semantic user control are extending the granularity, efficiency, and transparency of 3D generative modeling.

7. Open Challenges and Future Trajectories

Despite progress, part-level 3D generation remains challenged by:

The need for large-scale, diverse, high-quality part-annotated datasets to support training and generalization (prompting the synthesis of resources such as Partverse and 3D-PU).
Semantic segmentation under occlusion or for novel categories.
Ensuring global consistency and plausible assembly in highly articulated or complex objects.
Adaptive handling of varying numbers and granularities of parts, including topology-altering interpolation or dynamic decomposition (e.g., dual volume packing for efficient arbitrary part counts (Tang et al., 11 Jun 2025)).

Some approaches aim to relax hard semantic segmentation through "soft" assignments or to incorporate language-driven editing and procedural workflows. The trend is toward more interactive, open-vocabulary, and compositional generation paradigms where part manipulation, guided design, and scene-level arrangement are natively supported.

This synthesis reflects the breadth and technical depth of current part-level 3D generation research, ranging from sequential assembly and hierarchical modeling to diffusion, transformer-based, and cross-modal frameworks. Each architectural advance targets increased control, expressiveness, efficiency, and downstream applicability in scientific, industrial, and creative domains.