MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation (2412.03558v2)

Published 4 Dec 2024 in cs.CV

Abstract: This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.

Summary

The paper proposes a multi-instance diffusion paradigm that simultaneously generates multiple 3D objects while capturing their spatial relationships.
It introduces a novel multi-instance attention mechanism that allows tokens from different objects to interact, ensuring global scene coherence.
Experiments on synthetic and real-world datasets demonstrate state-of-the-art performance in object placement accuracy and geometric quality.

MIDI (2412.03558) is a research paper that introduces a novel method for generating compositional 3D scenes from a single 2D image. The core challenge is to infer the 3D geometry of individual objects and their spatial relationships from limited visual information.

Existing methods typically fall into three categories: feed-forward reconstruction, retrieval-based methods, and multi-stage compositional generation. Feed-forward methods train end-to-end networks on 3D datasets but struggle with generalization due to data scarcity. Retrieval-based methods match and assemble 3D models from a database, limited by database diversity and retrieval accuracy from a single image. Multi-stage compositional methods segment the image, complete object views, generate objects independently using powerful single-object models, and then optimize their layout. While leveraging strong object priors, these methods suffer from error accumulation across stages and lack a global scene context during object generation, leading to potential misalignments.

MIDI proposes a multi-instance diffusion model paradigm that extends pre-trained image-to-3D object generation models to directly generate multiple 3D instances simultaneously while capturing their spatial relationships. This approach aims to overcome the limitations of multi-stage pipelines by integrating object generation and spatial arrangement into a single, coherent process.

The MIDI framework builds upon existing 3D object generation models, which typically consist of a VAE for 3D data compression into a latent space and a denoising transformer (DiT) operating on this latent space. MIDI modifies this structure in three key ways for multi-instance generation:

Simultaneous Denoising: The latent representations of multiple 3D instances derived from the input image are denoised in parallel using a shared network.
Multi-Instance Attention: A novel attention mechanism is introduced within the DiT. Unlike standard self-attention where tokens from an object only interact with other tokens from the same object, multi-instance attention allows tokens from any given instance to query tokens from all instances in the scene. This enables the model to learn and enforce cross-instance interactions and spatial coherence directly during the diffusion process.
Image Conditioning: The model is conditioned on a composite image input that includes the global scene image, individual object images, and their masks. A ViT-based encoder processes this input, and the resulting features are integrated into the denoising network via cross-attention.

Training MIDI involves fine-tuning the denoising network and the image encoder. The loss function is based on the rectified flow objective, extended to minimize the error between the predicted and target noise for each instance's latent representation. To maintain the strong generalization capabilities of the pre-trained single-object models, MIDI employs a mixed training strategy. With a certain probability, training steps involve only a single object from a large-scale object dataset (like Objaverse), effectively turning off the multi-instance attention for these steps and regularizing the object generation prior. LoRA is used for efficient fine-tuning.

For inference, MIDI uses Grounded-SAM for initial object segmentation from the input image. The segmented object images, masks, and the global scene image are fed into the trained multi-instance diffusion model. The diffusion process generates the latent representations for all instances simultaneously, which are then decoded into 3D geometry. Classifier-free guidance is used to improve generation quality.

Experiments on synthetic datasets (3D-Front, BlendSwap) and real-world datasets (Matterport3D, ScanNet) demonstrate that MIDI achieves state-of-the-art performance. Quantitatively, MIDI significantly outperforms existing methods across metrics, particularly for object placement accuracy (Volume IoU of bounding boxes) and object-level geometry, while maintaining competitive scene-level metrics and runtime efficiency. Qualitatively, MIDI generates scenes with more accurate object geometries and better spatial coherence compared to baselines, especially in challenging cases with overlapping objects or limited views. Its generalization ability is further validated by generating plausible 3D scenes from stylized images created by text-to-image models.

Ablation studies highlight the importance of the proposed components:

The multi-instance attention is crucial for capturing spatial relationships; models without it produce incoherent layouts.
Using the global scene image as conditioning is essential for proper object placement and scene coherence.
Mixed training with single-object data is vital for preserving object-level geometric quality and preventing overfitting to the smaller scene dataset.

While MIDI marks a significant step, limitations include potentially lower resolution for small objects compared to generating them in isolation and challenges in modeling complex object interactions due to dataset limitations. Future work could explore addressing these by generating objects in canonical spaces while predicting their scene transforms and by training on richer datasets with more complex interactions.

In summary, MIDI provides a practical and effective method for single-image to 3D scene generation by reframing it as a multi-instance diffusion process with a novel attention mechanism, enabling the simultaneous generation of spatially coherent 3D objects from a single image.

PDF Markdown

Related Papers

Tweets

https://twitter.com/huanngzh/status/1864525323163062578

https://twitter.com/arxivsanitybot/status/1864863458853838956