- The paper introduces a novel method using autoregressive RGB-D image generation to synthesize coherent 3D scenes from 2D floorplans.
- It integrates a layout-attention mechanism to infuse geometric and semantic details from floorplans, ensuring global scene consistency.
- Depth-enhanced view synthesis with DeCaPE improves 3D reconstruction quality, achieving superior quantitative metrics and user study results.
HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models
The paper "HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models" presents a method for generating large-scale 3D indoor scenes from 2D floorplans using advanced 2D diffusion models. The system leverages pre-trained 2D models, originally trained on vast amounts of 2D image data, to synthesize RGB and depth images at multiple viewpoints. These generated images are subsequently fused to reconstruct a consistent and detailed 3D scene. HouseCrafter robustly handles the complexities of house-scale environments, providing promising results in generating highly detailed, faithful, and coherent 3D representations guided by given floorplans.
Key Contributions
- Autoregressive Generation of RGB-D Images: HouseCrafter adapts a 2D diffusion model to autoregressively generate multi-view RGB-D images. This generation is done in a batch-wise manner using previously generated images as conditions, ensuring inter-view consistency. The method uses a novel-view synthesis pipeline allowing efficient and semantically consistent generation.
- Integration of 2D Floorplan Guidance: The model introduces a layout-attention mechanism to incorporate floorplan information at different scales into the diffusion process, improving the global consistency of the generated large-scale scenes. The injection of geometric and semantic details from the floorplan ensures adherence to the specified configuration.
- Depth-Enhanced View Synthesis: HouseCrafter includes depth information in both input and output stages, decoupling geometry and appearance. This enhancement facilitates a more accurate 3D scene reconstruction, addressing the limitations of prior methods that suffer from scale ambiguity and depth inconsistencies.
Methodology
Novel View RGB-D Image Generation
The core of HouseCrafter is its novel view synthesis model which extends a pre-trained UNet from the StableDiffusion v1.5 to handle RGB-D data. The model processes multiple views simultaneously, ensuring cross-view consistency. The integration of the floorplan happens at several layers of the UNet as a layout-attention mechanism, which allows the input latent features to be modulated by the encoded layout information independently for each ray going through the image.
Depth-Enhanced Camera Positional Encoding (DeCaPE)
To leverage depth information from reference views, the model employs DeCaPE, an augmented positional encoding that incorporates 3D positions of reference image features. This encoding improves the cross-attention mechanism between target and reference features, enhancing the geometric consistency across views.
Results
The method has been evaluated on the 3D-Front dataset, showcasing its capability to generate high-quality 3D scenes from floorplans. Quantitative metrics for image quality (FID, IS) and depth (AbsRel, δi) demonstrate the superior performance of HouseCrafter over baseline methods like CC3D and Text2Room. The ablation studies underline the importance of depth conditioning and floorplan guidance, showing significant improvements in consistency and visual fidelity when these components are included.
User Study and Layout Compliance
An extensive user paper further corroborates the quantitative results, indicating a strong preference for HouseCrafter's outputs in terms of visual appeal and alignment with given floorplans. Additionally, the use of ODIN for layout compliance metrics confirms that HouseCrafter's generated scenes better adhere to the input floorplan configuration, with mAP scores significantly higher than those of the baselines.
Implications and Future Directions
The research presented in this paper holds substantial practical and theoretical implications. On a practical level, it offers a scalable and efficient tool for generating detailed 3D indoor scenes, which can significantly reduce manual effort in industries like architecture, interior design, and real estate visualization. Theoretically, this work demonstrates the potential of combining 2D generative models with floorplan guidance to overcome the challenges associated with scarce 3D data.
Future research could explore:
- Enhanced 3D Reconstruction Techniques: Developing reconstruction methods that can model view-dependent colors to improve the realism of the textured meshes.
- Optimized Pose Sampling: Designing more efficient pose sampling strategies that balance between consistency and computational efficiency.
- Instance-aware Generation: Integrating instance-level information to further improve fidelity to the input floorplans.
Overall, HouseCrafter is a notable advancement towards automated, scalable, and high-fidelity 3D scene generation from 2D layouts, pushing the boundaries of current techniques and opening new avenues for practical applications and research enhancements.