Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers (2506.05573v1)

Published 5 Jun 2025 in cs.CV

Abstract: We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.

Summary

  • The paper introduces a novel end-to-end method that synthesizes semantically meaningful 3D parts from a single RGB image using a compositional latent space.
  • The paper employs a modified 3D mesh diffusion transformer with hierarchical local and global attention to capture both intra-part details and inter-part relationships.
  • The paper demonstrates significant improvements in reconstruction fidelity, geometry independence, and generation speed compared to traditional segmentation-based methods.

PartCrafter is a novel approach for structured 3D mesh generation, enabling the synthesis of 3D objects and scenes as collections of distinct, semantically meaningful parts directly from a single RGB image. This contrasts with traditional methods that either generate monolithic 3D shapes or rely on multi-stage pipelines involving explicit 2D or 3D segmentation prior to reconstruction.

The core problem PartCrafter addresses is the lack of inherent part-based structure in many existing 3D generative models. Generating whole objects without part decomposition limits their utility in downstream applications like editing, animation, or physical simulation. Two-stage methods attempt to add structure by first segmenting the input image or a coarse 3D reconstruction and then generating/reconstructing each segment. However, these pipelines are prone to errors from the segmentation step, are computationally expensive, struggle with scalability, and often fail to reconstruct parts not visible in the input image.

PartCrafter overcomes these limitations by proposing an end-to-end, single-stage generative model. It takes a single RGB image as input and jointly denoises multiple sets of latent variables, where each set is designated to represent a specific 3D part. This is achieved by building upon a pretrained 3D mesh Diffusion Transformer (DiT), specifically leveraging components from TripoSG (2502.06608), a state-of-the-art image-to-mesh model trained on monolithic objects.

The key innovations introduced in PartCrafter are:

  1. Compositional Latent Space: Instead of a single set of latents for a whole object, PartCrafter uses NN sets of latent tokens, {zi}i=1N\{\boldsymbol{z}_i\}_{i=1}^N, where each zi\boldsymbol{z}_i represents the ii-th part. This disentanglement allows for independent manipulation of parts. A learnable part identity embedding is added to each token set to help the model distinguish between parts.
  2. Hierarchical Attention Mechanism: The model uses a modified DiT architecture that incorporates both local and global attention. Local attention is applied independently within each part's latent token set (zi\boldsymbol{z}_i), focusing on intra-part details. Global attention operates on the concatenated set of all part tokens (Z\boldsymbol{\mathcal{Z}}), capturing inter-part relationships and global scene coherence. Image conditioning (using DINOv2 features (2304.07193)) is injected via cross-attention at both the local and global levels. The paper finds that alternating local and global attention in the DiT blocks works best.

To train PartCrafter, the authors curate a new dataset by mining part-level annotations from large existing 3D datasets like Objaverse (2303.10853), ShapeNet (1512.03012), and ABO (2202.05823). This dataset contains around 50,000 part-labeled objects and 300,000 individual parts, augmented with 30,000 monolithic objects for regularization. For 3D scene generation, they utilize the 3D-Front dataset (2102.05072).

The model is trained using a rectified flow matching objective (2209.15571, 2210.02747, 2209.03003, 2403.12015) on the compositional latent space. It is initialized with weights from a pretrained TripoSG model and finetuned using a curriculum strategy, first on a base model for up to 8 parts and then adapted for more parts (up to 16 for objects) or scene generation (up to 8 objects) on 3D-Front. Training is performed on 8 H20 GPUs and takes about 2 days. Each part is represented by 512 tokens.

Experiments demonstrate PartCrafter's effectiveness on both 3D part-level object generation and 3D scene reconstruction from a single image. It is compared against adapted baselines: HoloPart (2504.07943) (for objects) and MIDI (2412.03558) (for scenes), which rely on segmentation. PartCrafter significantly outperforms these baselines in reconstruction fidelity (Chamfer Distance, F-Score) on datasets like Objaverse, ShapeNet, ABO, and 3D-Front, while also achieving better or comparable geometry independence between parts (measured by IoU) and significantly faster generation times (seconds per object/scene compared to minutes). Notably, PartCrafter can infer and generate parts that are not visible in the input image, a common limitation of segmentation-based methods.

Ablation studies confirm the critical role of both local and global attention, the part identity embeddings for distinguishing parts, and the specific alternating order of local-global attention within the DiT architecture. The model is also shown to generate reasonable part decompositions for a varied number of parts from the same image.

Practical implementation considerations include the need for a pretrained 3D diffusion model backbone and a dataset with part-level annotations. The compositional latent space and hierarchical attention allow for scalability to multi-part objects and scenes. The method provides a foundation for downstream applications requiring structured 3D data. While current training relies on a relatively smaller part-annotated dataset compared to monolithic object datasets, the paper suggests scaling up the data used for DiT training could further improve performance. The environmental cost of training large diffusion models is acknowledged.

In summary, PartCrafter provides a practical, end-to-end solution for structured 3D generation from single images by integrating part-level understanding into a diffusion transformer through a novel compositional latent space and hierarchical attention mechanism, demonstrating superior performance over multi-stage segmentation-based approaches.

Youtube Logo Streamline Icon: https://streamlinehq.com