Part-Level 3D Generation

Updated 5 March 2026

Part-level 3D generation is the synthesis of objects as assemblies of semantically meaningful parts, each with distinct geometry, spatial, and sometimes semantic attributes.
It employs diverse methods such as implicit fields, voxel/mesh decomposition, latent token sets, and autoregressive transformers to ensure detailed control over individual components.
This approach enables interactive editing, controlled sampling, and enhanced data augmentation for applications in design, CAD, robotics simulation, and real-time 3D content creation.

Part-level 3D generation refers to the computational synthesis of 3D shapes as assemblies of discrete, semantically meaningful parts—each modeled and generated individually, then composed to form coherent objects. In contrast to monolithic, holistic representations, part-level paradigms grant explicit access to geometry, arrangement, and attributes of components, enabling editing, controlled sampling, and semantic manipulation at fine granularity. This field integrates innovations in deep generative modeling, latent space design, geometric processing, user-guided control, and dataset curation, addressing fundamental challenges in both fidelity and flexibility for 3D content creation.

1. Foundations and Representations

Part-level 3D generation methods formalize objects as structured sets—graphs, sequences, or collections—of parts, each endowed with individual geometric, spatial, and sometimes semantic parameters. The representations differ across frameworks:

Implicit fields: Each part $i$ is represented by an implicit function $f_i: \mathbb{R}^3 \to [0,1]$ , parameterized by a latent code $z_i$ and typically decoded by an MLP, with the overall shape produced by composing (union, max, sum) fields across parts (Guan et al., 2024).
Voxel or Mesh decomposition: Parts are modeled as independent high-resolution voxel grids or mesh tokens, with each grid or mesh associated with a bounding box or affine transformation to ensure correct spatial assembly (Ding et al., 30 Oct 2025).
Latent token sets / VecSet: Transformer-based VAEs encode surface samples or parts' geometries into sets of latent vectors, sometimes augmented with per-part segmentation or semantic labels to form "Geom-Seg" sets (He et al., 10 Dec 2025).
Sequential/autoregressive: Generative models (e.g., transformers, GRUs) produce parts one-by-one, explicitly capturing dependencies and enabling variable part counts and order (Li et al., 2024, Chen et al., 17 Jul 2025).
Part-style/config factorization: Cross-diffusion networks disentangle factors such as part geometry ("style") and configuration ("pose/transform"), manipulating them independently for more controlled assembly (Nakayama et al., 2023).
Hybrid 3D-2D latent tokens: Some methods represent each part by both geometric and appearance latents (e.g., from part-centric renders and point clouds), enabling both geometry and texture control (Dong et al., 11 Jul 2025).

2. Generative Architectures and Algorithms

Recent research demonstrates diverse generative schemes tailored to part-level structure:

Conditional Diffusion and Flow Models: Denoising diffusion probabilistic models (DDPMs) operate either on the latent representations of parts, voxelized fields, or mesh tokens. Modern frameworks utilize conditional diffusion with attention to both part and global context, with rectified flow or conditional flow matching objectives (Ding et al., 30 Oct 2025, He et al., 10 Dec 2025).
- Discrete token diffusion (for mesh generation) runs in semi-autoregressive mode: autoregression controls the inter-part sequence, while intra-part geometry is generated in parallel to capture high-frequency detail (Yang et al., 24 Nov 2025).
Autoregressive Transformers: Sequential models, particularly autoregressive transformers, generate parts as variable-length sequences of primitives (e.g., cuboids) or latent tokens, conditioned on previous parts and global context, achieving explicit control over part count, ordering, and structural diversity (Li et al., 2024, Chen et al., 17 Jul 2025).
Cross-attention & Message Passing: Hierarchical or multi-scale architectures inject part-derived features into main geometry representations at multiple levels, often via cross-attention, facilitating information flow between parts and global context (e.g., octree diffusion with hierarchical part conditioning (Gao et al., 14 Aug 2025)).
Retrieval-Augmented Generation: Recent frameworks incorporate retrieval modules that align image patches and 3D part latents, drawing from a curated part asset bank to infuse rare or physically plausible exemplars into the diffusion trajectory; this enhances diversity and fidelity on rare classes (Li et al., 19 Feb 2026).
User-in-the-loop and Editing: Many pipelines support interactive workflows: at each generation step, users can select, modify, or resample an individual part, influencing subsequent synthesis through conditional latent trajectories (Guan et al., 2024, Li et al., 19 Feb 2026).

3. Control, Diversity, and User Guidance

Part-level generation enables a suite of control and editing modalities that outstrip holistic approaches:

Diverse Suggestion and Regeneration: Multimodal conditional models (e.g., cIMLE, MDNs, cDDPMs) generate multiple diverse candidates for each part, from which users can select or further condition the assembly (Guan et al., 2024). cIMLE, in particular, achieves the highest diversity and coverage in quantitative metrics.
Explicit Conditioning: Many models permit explicit constraints via mask inputs, bounding box layouts, or text/image prompts mapped to part tokens, enabling targeted manipulation and direct part-by-part design (Yang et al., 8 Jul 2025, Dong et al., 11 Jul 2025).
Part-aware Editing: Once objects are part-decomposed, local edits—such as adding, deleting, or transforming specific parts—are performed via latent modification, cross-attention diffusion updates, or masked diffusion steps, preserving non-target parts (He et al., 10 Dec 2025, Li et al., 19 Feb 2026).
Semantic and Topological Consistency: Cross-part attention or dual-space latent prediction (i.e., global and canonical frames) is leveraged to ensure that part edits, resampling, or swaps maintain semantic compatibility and produce watertight, realistic assemblies (He et al., 10 Dec 2025, Ding et al., 30 Oct 2025).

4. Dataset Construction and Evaluation

Progress in part-level 3D generation is tightly coupled to advances in data curation and benchmarking:

Large-Scale, Part-Annotated Corpora: Datasets such as PartVerse-XL (~40K objects, 320K parts), HY3D-Bench (240K part-decomposed models), and PartVerse (curated from Objaverse) provide high-quality, semantically-labeled parts suitable for supervision and evaluation (Ding et al., 30 Oct 2025, Hunyuan3D et al., 3 Feb 2026, Dong et al., 11 Jul 2025).
Annotation Protocols: Pipelines combine automated segmentation (connectivity, UV cues, deep models) with human verification and cleaning to ensure each part forms a coherent, functional or semantic unit; per-part captions are generated via VLMs on rendered crops (Ding et al., 30 Oct 2025).
Evaluation Metrics: Common metrics include Chamfer Distance (CD), F-score at various thresholds, Earth Mover's Distance (EMD), Coverage (COV), Minimum Matching Distance (MMD), pairwise Euclidean distance (ED), and part IoU. These are measured both at the part and whole-object levels, often for both geometry and semantic fidelity (Guan et al., 2024, Ding et al., 30 Oct 2025, He et al., 10 Dec 2025, Yang et al., 24 Nov 2025).
Ablation and Comparative Analysis: Benchmarks compare against monolithic baselines (e.g., Trellis, HoloPart, TripoSG), hierarchical or hybrid methods (e.g., OmniPart, PartCrafter), and autoregressive or retrieval-augmented models, with ablation studies quantifying the benefit of part-aware modules or loss terms (Ding et al., 30 Oct 2025, Yang et al., 8 Jul 2025, Li et al., 19 Feb 2026).

5. Applications and Practical Impact

Adoption of part-level 3D generation techniques has catalyzed new capabilities across several domains:

Design and Editing Tools: These models permit interactive 3D asset creation, localized editing, and rapid generation of part variations—functionality relevant for digital content creation, CAD, and game asset workflows (Guan et al., 2024, Li et al., 19 Feb 2026).
Semantic Manipulation: Text-driven or example-guided part editing allows semantic alterations, such as resizing structural components or altering stylistic attributes at the part level, guided by CLIP or VLM features (He et al., 10 Dec 2025, Ding et al., 30 Oct 2025).
Scene and Object Composition: Hierarchical and compositional frameworks facilitate not only single-object assembly but scene-level grouping, treating objects as scene parts to enable multi-object synthesis (Dong et al., 11 Jul 2025).
Data Augmentation: Part-aware generative priors can provide controllable asset augmentations for downstream 3D understanding, robotics simulation, and segmentation training (Nakayama et al., 2023, Hunyuan3D et al., 3 Feb 2026).
Fine-Grained Style Control: In artistic or specialized domains (e.g., 3D font design), dynamic part assignment and per-component optimization are leveraged to control style injection and geometric regularity (Gan et al., 29 Nov 2025).

6. Limitations and Ongoing Research Directions

Despite substantial advances, notable challenges remain:

Scalability and Efficiency: Per-part high-resolution grids or deep diffusion passes increase computational demands. Sparse or multi-resolution representations, as well as more efficient autoregressive or masking strategies, remain active research areas (Ding et al., 30 Oct 2025, Yang et al., 24 Nov 2025).
Part Granularity and Semantic Control: Automatic part decomposition depends heavily on dataset segmentation quality and limits explicit granularity control for users; more flexible ontologies and user-driven decompositions are needed (Tang et al., 11 Jun 2025, Li et al., 19 Feb 2026).
Beyond Axis-Aligned or Primitive Parts: Current frameworks often rely on axis-aligned bounding boxes or voxel grids, limiting expressivity for curved or complex parts; extending to superquadrics, swept volumes, or direct mesh representations is a key open direction (Ding et al., 30 Oct 2025).
Articulated and Dynamic Parts: Handling articulated objects, hierarchical assemblies, or deformable structures—where part connectivity and movement are complex—requires richer priors and new generative mechanisms (Dong et al., 11 Jul 2025, Yang et al., 8 Jul 2025).
Interactive and Real-Time Generation: Real-time editing and generation demand further optimization, including model distillation, latent caching, and acceleration of diffusion-based synthesis (Ding et al., 30 Oct 2025, Li et al., 19 Feb 2026).

Collectively, part-level 3D generation now offers a robust foundation for compositional, controllable, and semantically-structured shape modeling, underpinned by strong algorithmic and dataset advances. Continued research is poised to further expand its reach into increasingly complex design tasks and interactive media (Guan et al., 2024, Ding et al., 30 Oct 2025, He et al., 10 Dec 2025, Chen et al., 2024, Yang et al., 24 Nov 2025, Chen et al., 17 Jul 2025, Li et al., 19 Feb 2026).