HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model (2406.20077v1)

Published 28 Jun 2024 in cs.CV

Abstract: We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/

Authors (5)

Hieu T. Nguyen (19 papers)
Yiwen Chen (52 papers)
Vikram Voleti (25 papers)
Varun Jampani (125 papers)
Huaizu Jiang (38 papers)

Summary

The paper introduces a novel method using autoregressive RGB-D image generation to synthesize coherent 3D scenes from 2D floorplans.
It integrates a layout-attention mechanism to infuse geometric and semantic details from floorplans, ensuring global scene consistency.
Depth-enhanced view synthesis with DeCaPE improves 3D reconstruction quality, achieving superior quantitative metrics and user study results.

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

The paper "HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models" presents a method for generating large-scale 3D indoor scenes from 2D floorplans using advanced 2D diffusion models. The system leverages pre-trained 2D models, originally trained on vast amounts of 2D image data, to synthesize RGB and depth images at multiple viewpoints. These generated images are subsequently fused to reconstruct a consistent and detailed 3D scene. HouseCrafter robustly handles the complexities of house-scale environments, providing promising results in generating highly detailed, faithful, and coherent 3D representations guided by given floorplans.

Key Contributions

Autoregressive Generation of RGB-D Images: HouseCrafter adapts a 2D diffusion model to autoregressively generate multi-view RGB-D images. This generation is done in a batch-wise manner using previously generated images as conditions, ensuring inter-view consistency. The method uses a novel-view synthesis pipeline allowing efficient and semantically consistent generation.
Integration of 2D Floorplan Guidance: The model introduces a layout-attention mechanism to incorporate floorplan information at different scales into the diffusion process, improving the global consistency of the generated large-scale scenes. The injection of geometric and semantic details from the floorplan ensures adherence to the specified configuration.
Depth-Enhanced View Synthesis: HouseCrafter includes depth information in both input and output stages, decoupling geometry and appearance. This enhancement facilitates a more accurate 3D scene reconstruction, addressing the limitations of prior methods that suffer from scale ambiguity and depth inconsistencies.

Methodology

Novel View RGB-D Image Generation

The core of HouseCrafter is its novel view synthesis model which extends a pre-trained UNet from the StableDiffusion v1.5 to handle RGB-D data. The model processes multiple views simultaneously, ensuring cross-view consistency. The integration of the floorplan happens at several layers of the UNet as a layout-attention mechanism, which allows the input latent features to be modulated by the encoded layout information independently for each ray going through the image.

Depth-Enhanced Camera Positional Encoding (DeCaPE)

To leverage depth information from reference views, the model employs DeCaPE, an augmented positional encoding that incorporates 3D positions of reference image features. This encoding improves the cross-attention mechanism between target and reference features, enhancing the geometric consistency across views.

Results

The method has been evaluated on the 3D-Front dataset, showcasing its capability to generate high-quality 3D scenes from floorplans. Quantitative metrics for image quality (FID, IS) and depth (AbsRel, $\delta_i$ ) demonstrate the superior performance of HouseCrafter over baseline methods like CC3D and Text2Room. The ablation studies underline the importance of depth conditioning and floorplan guidance, showing significant improvements in consistency and visual fidelity when these components are included.

User Study and Layout Compliance

An extensive user paper further corroborates the quantitative results, indicating a strong preference for HouseCrafter's outputs in terms of visual appeal and alignment with given floorplans. Additionally, the use of ODIN for layout compliance metrics confirms that HouseCrafter's generated scenes better adhere to the input floorplan configuration, with mAP scores significantly higher than those of the baselines.

Implications and Future Directions

The research presented in this paper holds substantial practical and theoretical implications. On a practical level, it offers a scalable and efficient tool for generating detailed 3D indoor scenes, which can significantly reduce manual effort in industries like architecture, interior design, and real estate visualization. Theoretically, this work demonstrates the potential of combining 2D generative models with floorplan guidance to overcome the challenges associated with scarce 3D data.

Future research could explore:

Enhanced 3D Reconstruction Techniques: Developing reconstruction methods that can model view-dependent colors to improve the realism of the textured meshes.
Optimized Pose Sampling: Designing more efficient pose sampling strategies that balance between consistency and computational efficiency.
Instance-aware Generation: Integrating instance-level information to further improve fidelity to the input floorplans.

Overall, HouseCrafter is a notable advancement towards automated, scalable, and high-fidelity 3D scene generation from 2D layouts, pushing the boundaries of current techniques and opening new avenues for practical applications and research enhancements.

PDF Markdown

Related Papers

GitHub

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

Tweets

https://twitter.com/HuaizuJiang/status/1808279100601294944

https://twitter.com/gm8xx8/status/1807589815853961371

https://twitter.com/CSVisionPapers/status/1807911967224394021

YouTube

Show All Videos