Image-Based Part-Level Generation

Updated 30 June 2025

Image-based part-level generation methods are advanced techniques that decompose visuals into discrete, meaningful parts for granular control and synthesis.
They enable applications such as controllable 2D/3D editing, animation, and robotic manipulation by allowing modular manipulation and detailed reconstruction.
Innovations like diffusion models and compositional latent spaces improve fidelity, enable part mixing, and support cross-modal reasoning for diverse real-world tasks.

Image-based part-level generation methods refer to techniques that synthesize, manipulate, or reason about images—or image-to-3D/4D mappings—by decomposing visual data into discrete, semantically meaningful parts and generating or composing outputs at the part-granularity. Such methodologies address limitations of holistic, monolithic synthesis by enabling structural control, fine-grained editing, and more interpretable or functionally useful output for applications spanning 2D image editing, controllable generative modeling, 3D/4D shape and animation synthesis, robotics, design, and scene understanding.

1. Foundations: Part-Level Decomposition and Generation

The core of part-level generative methods is the explicit decomposition of imagery into atomic “parts”—subcomponents characterized either geometrically (e.g., mesh parts, shape segments), semantically (e.g., object regions, garment elements), or functionally (e.g., articulated links, tool heads). Pioneering research such as Composite Generative Adversarial Networks (CGAN) introduced architectures with multiple generators, each responsible for synthesizing a latent part of an image and composited via alpha blending (Kwak et al., 2016). These unsupervised, multi-generator designs empirically demonstrated that deep networks can self-organize to structure visual scenes partwise without explicit annotation, leveraging alpha regularization to encourage each generator’s involvement.

Subsequent methods built on the intuition that breaking visual data into composable parts—not only at the pixel level, but also in higher-dimensional embeddings (e.g., latent codes, 3D primitives)—supports controllability, modularity, and enables downstream manipulation and understanding.

2. Structured Latent Modeling and Part-Aware Diffusion

Diffusion-based methods have established a strong foothold in part-level generation, especially for 3D objects. SALAD (Koo et al., 2023) introduced a cascaded diffusion model operating on part-level implicit representations, where Gaussian primitives for each semantic part are sequentially generated: low-dimensional “extrinsic” vectors encode pose, scale, and orientation, followed by high-dimensional “intrinsic” attributes controlling fine geometry. This approach enables zero-shot completion, part mixing, and localized editing by virtue of disentangled latent spaces. Performance benchmarks demonstrate superior coverage and fidelity compared to holistic diffusion models, and highlight support for text- and mask-guided part completion.

PartGen (Chen et al., 24 Dec 2024) extended these ideas using multi-view diffusion for both part segmentation (via stochastic “coloring” of multi-view renders) and part completion, enabling not only the decomposition of monolithic assets but also hallucination of occluded or missing geometry. This dual-model design ensures that completed parts remain cohesive when recombined, facilitating seamless downstream editing and 3D assembly.

PartCrafter (Lin et al., 5 Jun 2025) utilizes a compositional latent space wherein each 3D part is encoded by a distinct set of latent tokens, and introduces a hierarchical local-global attention mechanism (local within-part, global across-parts) to balance individual detail and overall coherence. This design enables direct, simultaneous generation of arbitrarily many decomposable 3D meshes from a single image—without pre-segmentation.

Dual volume packing (Tang et al., 11 Jun 2025) introduces a graph-theoretic approach: parts in contact are assigned to complementary packed volumes based on bipartite partitioning, ensuring that SDF-based mesh extraction produces disjoint, complete part geometries in constant-generation time, regardless of part count or topology.

3. Compositional Generation, Control, and Part Mixing

Part-based generation is leveraged for explicit compositional control. StylePart (Shen et al., 2021) connects the latent space of image generative models (e.g., StyleGAN) with part-aware 3D attribute spaces, enabling manipulation tasks such as part replacement, resizing, and view synthesis by forward- and backward-mapping latent codes between image and shape domains.

PartComposer (Liu et al., 3 Jun 2025) and PiT (Piece it Together) (Richardson et al., 13 Mar 2025) support one-shot part-level learning from annotated images, with mutual information maximization (PartComposer) ensuring each part’s code is clearly identifiable and composable, and flow-based priors (PiT) filling in missing parts conditioned on user-selected component embeddings. Both methods demonstrate superior performance in part identity preservation, diversity, and controllable editing, with PartComposer specifically achieving disentanglement and scalable mixing even with limited data.

Recent fine-grained diffusion pipelines (e.g., PartStickers (Zhou et al., 7 Apr 2025)) specialize in generating isolated parts on neutral backgrounds for rapid prototyping: images and part masks are combined into “sticker” data, and LoRA-adapted diffusion preserves both part specialization and object-level synthesis capabilities.

InstanceGen (Sella et al., 8 May 2025) introduces a two-stage system that infers instance-level object assignments, attributes, and spatial layout from initial images using cross-attention and segmentation, then refines output via attention-masked diffusion guided by LLM-based per-instance instructions, ensuring strict and interpretable prompt adherence even for complex, multi-instance prompts.

Query-based and tokenized Transformer models support part-level prediction and synthesis for various tasks:

QueryPose (Xiao et al., 2022) demonstrates direct sparse pose regression by deploying a system of part-level queries refined by local spatial embeddings, achieving high accuracy and efficiency for multi-person pose estimation, with significant robustness and minimal post-processing.
ARMANI (Zhang et al., 2022) introduces part-level garment–text alignment for cross-modal fashion generation, employing MaskCLIP to align semantic garment segments with textual concepts at the token level. A cross-modal discrete codebook enables image synthesis from text, sketch, and partial images, showing clear advantages over prior VQ-GAN-based methods in producing and editing fine-grained garment layouts.
OMG-LLaVA (Zhang et al., 27 Jun 2024) integrates a universal segmentation backbone with a large multimodal LLM, using perception-prior tokenization to enable instruction-guided, pixel/part-level segmentation and reasoning. The framework supports arbitrarily complex vision-language interactions and reasoning-driven part extraction, with strong benchmark results on referring expression and grounded conversation tasks.
Parts2Whole (Huang et al., 23 Apr 2024) achieves controllable human image generation by encoding part-specific reference images into dense, label-guided features, and fusing them into the diffusion process with shared, mask-guided self-attention, allowing arbitrary selection, swapping, and spatially-aware mixing of visual components.

5. Practical Applications Across Domains

Image-based part-level generation methods are deployed across a variety of real-world scenarios:

Interactive design and prototyping: BOgen (Lee et al., 2023) applies Bayesian optimization and VAE dimensionality reduction to map the high-dimensional part design space to an interactive 2D map, with user intention coupling and GP-UCB sampling for guided exploration; PartStickers (Zhou et al., 7 Apr 2025) facilitates modular prototyping for design and gaming assets.
3D/4D graphics and animation: SALAD (Koo et al., 2023), PartGen (Chen et al., 24 Dec 2024), PartCrafter (Lin et al., 5 Jun 2025), and Puppet-Master (Li et al., 8 Aug 2024) demonstrate part-aware synthesis for creative, animation, or robotic simulation; Puppet-Master especially supports interactive, drag-based part motion synthesis generalized from synthetic to real data.
Robotics and manipulation: Part-level shape and kinematic estimation (Kawana et al., 4 Apr 2025, Liu et al., 21 Oct 2024) from single RGB or RGBD images enables object-centric autonomous planning and manipulation, with improved generalization due to scalable part-representation and kinematics-aware grouping.
Fashion, product editing, and content creation: ARMANI (Zhang et al., 2022) and Parts2Whole (Huang et al., 23 Apr 2024) support garment and appearance customization on a fine-grained, multi-modal basis.
Biomedical/industrial diagnosis: Domain-specific segmentation and part-aware augmentation aid disease detection in aquaculture settings (Hwang et al., 16 Jul 2024).
Compositional image synthesis and fine-grained editing: InstanceGen (Sella et al., 8 May 2025) achieves prompt-level semantic control across multiple objects and relations, while PartComposer (Liu et al., 3 Jun 2025) and PiT (Richardson et al., 13 Mar 2025) enable rich ideation, style transfer, and modular concept development.

6. Evaluation, Limitations, and Emerging Research

Empirical results across these systems demonstrate quantitative advances in part preservation, image/shape fidelity (e.g., FID, CLIPScore, mAP, IoU, F-Score), controllability, and sample efficiency. Many studies report strong performance on large benchmarks (e.g., COCO, ShapeNet, proprietary part-annotated datasets), alongside user studies confirming utility for design workflows.

Limitations include:

Semantic unpredictability: Unsupervised or under-constrained role assignment may yield ambiguous or drifting generator responsibilities (Kwak et al., 2016).
Occulusion ambiguity and reconstruction: Amodal completion remains challenging (addressed via contextual completion models (Chen et al., 24 Dec 2024)).
Scalability and representation: Increasing part numbers, inter-part contact, or complex kinematic graphs challenge both model and data curation (addressed in dual volume packing (Tang et al., 11 Jun 2025) and hierarchical graph-based methods (Liu et al., 21 Oct 2024)).
Part definition ambiguity: Artifacts may arise from under- or over-segmentation, especially in cross-category or heavily stylized domains.
Combinatorial explosion: For models relying on explicit training over permutations (e.g., PiT), handling rare part combinations or unseen categories may be limiting without further advances in foundation model generalization.

Emerging research directions include scene-level structured generation, iterative user-guided part synthesis, universal part tokenization for instruction-following MLLMs, and robust handling of variable-granularity decompositions.

7. Summary Table: Representative Approaches

Method / Paper	Domain	Part Representation	Compositionality	Key Innovation
CGAN (Kwak et al., 2016)	2D Image	Generators + alpha blending	Yes (unsupervised)	Decoupled generators, Alpha loss
StylePart (Shen et al., 2021)	Image + 3D	Image↔3D part latent codes	Yes	Shape-consistent mappings
SALAD (Koo et al., 2023)	3D Shape	Gaussian primitives per part	Yes (cascaded)	Diffusion in extrinsic/intrinsic latent space
PartGen (Chen et al., 24 Dec 2024)	3D Shape	Multi-view segmentations	Yes	Multi-view diffusion for seg+completion
PartCrafter (Lin et al., 5 Jun 2025)	3D Mesh	Disentangled tokens per part	Yes	Hierarchical attention (local/global)
PartComposer (Liu et al., 3 Jun 2025)	2D Image	Single-image, learned tokens	Yes (one-shot)	Mutual info maximization for disentanglement
ARMANI (Zhang et al., 2022)	Cross-modal	Cross-modal tokens	Yes	MaskCLIP, cross-modal Transformer
InstanceGen (Sella et al., 8 May 2025)	2D to 2D	Instance masks/attention	Yes	LLM-guided assignment, attention-masked loss

By leveraging semantic part decomposition, structured latent models, and increasingly powerful generative and reasoning architectures, image-based part-level generation methods now underpin a wide range of controllable, interoperable, and structurally-aware applications across vision, graphics, robotics, and design.