Part-Specific Multi-Mesh Generation

Updated 31 May 2026

Part-specific multi-mesh generation is a paradigm that produces 3D objects as distinct, semantic parts, each with its own coherent mesh for improved editing and simulation.
Recent methods combine autoregressive, parallel, and hybrid techniques to ensure structural coherence and high geometric fidelity in part decomposition.
Standardized metrics like Chamfer Distance and part IoU validate these techniques, driving advancements in 3D asset pipelines for animation and physical simulation.

Part-specific multi-mesh generation refers to the class of methods and frameworks aimed at producing 3D objects explicitly as collections of semantically distinct parts, with each part represented as its own coherent mesh or field. This paradigm allows for downstream applications such as structured editing, physical simulation, animation, and part-level manipulation, which are infeasible or brittle with monolithic mesh representations. The following sections survey the core principles, algorithmic architectures, mathematical formulations, and current empirical boundaries of the field, focusing on recent advances up to 2026.

1. Foundations and Problem Motivation

Traditional 3D generative models, whether based on implicit fields, global diffusion models, or holistic mesh autoencoders, predominantly output a single, fused mesh or field devoid of part structure. This approach precludes direct editing or articulation of subcomponents and complicates semantic downstream tasks. The need for structured multi-mesh outputs is motivated by requirements from 3D content pipelines (asset libraries, games, simulation), where semantic parts underpin animation rigging, material assignment, and behavioral scripting. Multi-part generation must satisfy:

Semantic decomposition: Each part must correspond to a meaningful object component (e.g., “chair back,” “airplane wing”).
Structural coherence: Parts must assemble seamlessly, respecting physical and geometric constraints (e.g., no gaps or overlaps at joints).
Controllability: Ability to select part identities or counts, ideally with open-vocabulary or user-driven granularity (Zhu et al., 27 May 2026).
Geometric fidelity: Each part should exhibit high accuracy in both global and local geometric features.

This presents a tension between global-topological enforcement and local fine-grained detail—autoregessive models tend toward globally plausible but overly smoothed parts, whereas parallel models may achieve detail but drift structurally. Recent advances address this dichotomy by hybridizing autoregressive sequencing, per-part parallelism, and compositional latents (Yang et al., 24 Nov 2025, Lin et al., 5 Jun 2025, Ding et al., 30 Oct 2025).

2. Architectural Paradigms

2.1. Semi-Autoregressive and Hierarchical Models

PartDiffuser (Yang et al., 24 Nov 2025) introduces a hybrid, semi-autoregressive diffusion protocol: global topology is enforced by generating parts in autoregressive order (determined by BFS over the part-adjacency graph), while local part geometry is recovered in parallel for each part using masked discrete diffusion. The backbone is a DiT variant with a composite attention mask that ensures intra-block (intra-part) bidirectional attention and inter-block (inter-part) strict causality. Part-aware cross-attention incorporates both global and part-specific context vectors.

Hierarchical and compositional approaches—exemplified by PartCrafter (Lin et al., 5 Jun 2025)—organize the latent space into disjoint part-specific slots, each processed with local attention, while periodic global attention layers enforce coherence. This compositional transformer model enables simultaneous denoising of all parts, integrating within-part and cross-part information flows.

2.2. Hybrid Implicit/Explicit Pipelines

FullPart (Ding et al., 30 Oct 2025) advances the generation of high-resolution details by combining implicit layout diffusion (for bounding box prediction and rough arrangement) with explicit voxel-based diffusion over canonical, per-part grids. Each part is generated inside its own $64^3$ voxel grid, mapped globally with a “center-corner” encoding scheme, and refined with mesh VAEs. This avoids the voxel-budget dilution of shared global grids and sharply improves small-part fidelity.

UniPart (He et al., 10 Dec 2025) unifies geometry and segmentation in a single latent code (the Geom-Seg VecSet), enabling two-stage latent diffusion: initial generation yields a joint object geometry and part mask, and subsequent refinement operates on per-part latents in both global and normalized canonical spaces. The mesh decoder reconstructs each part as an implicit surface, positioned via transforms derived from dual-space correspondence.

2.3. Data Structure and Synchronization

The codimensional multimesh framework (Tao et al., 2 Jan 2025) is orthogonal, focusing on the hierarchy and consistency of embedded meshes of varying dimensionality. A rooted tree structure encodes containment maps between submeshes (e.g., UV seams within a surface mesh), and algorithmic extension/restriction operations synchronize edits throughout the hierarchy. The multimesh maintains mathematical invariants (face-purity, manifoldness) under local edits via link conditions and energy-constrained optimization.

3. Mathematical Formulations and Losses

3.1. Generative Decomposition

The essence of part-specific multi-mesh synthesis is a factorizable generative process:

$p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$

where $X_i$ is the collection of tokens representing part $i$ , and $C_{pc}$ includes all global and part-specific conditions (Yang et al., 24 Nov 2025). Semi-autoregressive models sequentially condition on previously completed parts, while parallel compositional models propagate information via hierarchical attention.

3.2. Diffusion and Flow Matching

Both discrete (token-based) and latent (continuous) diffusion are employed. PartDiffuser (Yang et al., 24 Nov 2025) employs masked discrete diffusion:

Forward process: masking/unmasking transitions with respect to a fixed vocabulary.
Reverse process: denoising models inferring $p_\theta(x_{t-1} | x_t, X_{<i}, C_{dyn})$ .

PartCrafter and CubePart (Lin et al., 5 Jun 2025 Zhu et al., 27 May 2026) use velocity-based flow matching, parameterizing a noisy latent $Z_t$ as a convex combination of the data latent $Z_0$ and random initialization $Z_1$ ,

$Z_t = t Z_0 + (1-t) Z_1,$

with the network predicting the velocity $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 0.

3.3. Assembly and Part Consistency

Assembly strategies require explicit mitigation of part overlap/gaps. FullPart leverages NMS for bounding boxes, center-corner encoding to softly align boundaries, and mesh-VAE decoders trained for watertightness (Ding et al., 30 Oct 2025). Junction conditioning and “junction face” losses (as in MeshArt (Gao et al., 2024)) can further enforce continuity across parts.

4. Representational Strategies

Framework	Part Representation	Global–Part Coupling	Notable Features
PartDiffuser	Token blocks (DiT)	Autoregressive; cross-attn	Semi-AR, blockwise parallelism
PartCrafter	Compositional slot latents	Alternating local/global attn	Joint compositional diffusion
FullPart	Voxel grid per part	Center-corner encoding	Implicit box layout, max detail
UniPart	Unified geom-seg VecSet	2-stage diffusion	Dual-space decoding, no external seg
SDM-NET	Per-part VAE mesh codes	Structured Parts VAE	Joint ELBO, structure refinement
GetMesh	Latent point subsets	Latent manipulation	Arbitrary add/drop, cross-category
CubePart	Partwise SDF latents	Cross-part attn blocks	Open-vocab, schema-driven
MeshArt	Triangle VQ-VAE tokens	Structure-guided AR	Articulated, junction-conditioned
Codimensional MM	Hierarchical simplices	Containment maps, link cond.	Edits propagate through hierarchy

Representational choices impact editability, granularity, and downstream suitability. Explicit canonical grids (FullPart), triangle-based tokens (MeshArt), and latent point subsets (GetMesh) each have complementary strengths.

5. Dataset Constructions and Supervision

High-quality, large-scale, part-annotated datasets underpin recent SOTA results. PartVerse-XL (Ding et al., 30 Oct 2025) provides $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 1K human-verified parts for $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 2K objects. CubePart (Zhu et al., 27 May 2026) builds upon an 11x larger asset pool (462K assets, 2M parts), using VLM-based (GPT-5) clustering for open-vocabulary part labeling, and cleaning through multi-view artifact detection and semantic consolidation. Many methods leverage GLTF metadata, manual expert curation (Blender merging/splitting), and dense point/normal sampling per part to support fine-grained multi-mesh supervision. A plausible implication is that advances in scalable, automated part discovery pipelines have enabled schema-driven and open-vocabulary multi-mesh synthesis at previously unattainable scale and diversity.

6. Empirical Assessment and Ablation

Quantitative evaluation is standardized around part-level Chamfer Distance, F-score (typically at $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 3), part IoU, and runtime (seconds per asset). Recent SOTA metrics include:

Method	CD ↓ (Objaverse)	F1 ↑	Part IoU ↓	Runtime (s)
PartCrafter	0.1726	0.7472	0.0359	34
FullPart	0.11	0.81	0.36	—
CubePart	0.251 (part)	0.743	—	—

PartDiffuser (Yang et al., 24 Nov 2025) achieves $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 427% lower CD than MeshAnythingV2 and TreeMeshGPT. CubePart achieves $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 5 on holistic F-score (union of part outputs) and $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 6 at part granularity (Zhu et al., 27 May 2026).

Ablation studies confirm:

Loss of global context or hierarchical conditioning increases CD substantially (e.g., “parts only” setting in PartDiffuser).
Omitting cross-part attention in CubePart degrades part-level completeness (CD rises from 0.251 to 0.433) and introduces floating or overlapping geometry.
Higher partwise parallelism accelerates inference (speedup up to 3.7 $p_\theta(X \mid C_{pc}) = \prod_{i=1}^N p_\theta(X_i \mid X_{<i}, C_{pc})$ 7 in PartDiffuser's blockwise diffusion), but may double CD.

7. Trends and Open Challenges

Recent advances have established:

Fully end-to-end open-vocabulary, user-driven schema control (CubePart), allowing specification of arbitrary part lists at inference.
State-of-the-art geometric completeness and fidelity at both global and per-part-resolution—enabled by explicit partwise conditioning, compositional latents, and high-resolution per-part decoding (FullPart, PartCrafter).
Rich support for various downstreams—animation, simulation, behavior scripting—via explicit, watertight part outputs.
Automated, scalable dataset creation using vision–LLMs to bridge semantic, geometric, and naming gaps across disparate sources.

Limitations remain:

Fine-grained part granularity is tied to noisy human or artist annotations in most pipelines, with limited current support for user-controlled granularity beyond schema enumeration (Zhu et al., 27 May 2026).
Bipartite packing solutions (e.g., Dual Volume Packing (Tang et al., 11 Jun 2025)) are bounded in connectivity and inflexible for higher-order adjacent-part relations.
Highly-connected graphs or bodies with >2 mutually-adjacent parts challenge two-volume methods and may require extensions (e.g., 3–4 colorings).
Explicit junction conditioning, support/symmetry constraints, and postprocessing are still necessary for seamless assembly in certain complex scenarios (Gao et al., 2024, Gao et al., 2019).

Future research directions focus on planar graph coloring for richer packing (Tang et al., 11 Jun 2025), real-time editing, more data-efficient part representation, and tighter integration of controllable part-specific text/image guidance.

References: