CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout (2303.13843v5)

Published 24 Mar 2023 in cs.CV

Abstract: Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.

References (56)

Authors (7)

Haotian Bai (10 papers)
Yuanhuiyi Lyu (25 papers)
Lutao Jiang (13 papers)
Sijia Li (33 papers)
Haonan Lu (35 papers)
Xiaodong Lin (31 papers)
Lin Wang (403 papers)

Citations (34)

View on Semantic Scholar

Summary

"CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout" addresses the challenges faced by current neural radiance fields (NeRFs) and diffusion models in generating complex 3D scenes based on textual descriptions. Despite the progress made in text-to-3D object generation, existing models struggle with accurately parsing and rendering multi-object environments, often failing to maintain consistency and fidelity when translating intricate semantic details from text prompts into 3D scenes.

To overcome these limitations, CompoNeRF introduces a novel framework that enhances the quality and flexibility of 3D scene generation by integrating an editable 3D scene layout along with sophisticated guidance mechanisms. The key innovations of CompoNeRF are as follows:

Editable 3D Layout: The framework starts by interpreting complex text prompts into an editable 3D layout. This layout is populated with multiple NeRFs, each one paired with a specific subtext prompt to ensure precise and detailed depiction of individual objects within the scene.
Object-specific and Scene-wide Guidance Mechanisms: CompoNeRF employs dual-level text guidance to mitigate ambiguity and enhance accuracy. This guidance system ensures that both individual objects and the overall scene composition are consistent with the provided textual description.
Composition Module: A tailored composition module seamlessly integrates the individual NeRFs into a coherent 3D scene. This module plays a critical role in maintaining the visual and semantic consistency of the final rendered output.
NeRF Decomposition: The modular design of CompoNeRF allows for NeRF decomposition, enabling flexible editing of the scene. Researchers or users can modify specific elements of the scene layout or adjust the text prompts to generate new compositions without starting from scratch.
Performance Improvements: Leveraging the Stable Diffusion model, CompoNeRF demonstrates significant performance improvements, achieving up to a 54% increase in the multi-view CLIP score metric, which measures the alignment of 3D scene renderings to their corresponding text descriptions.

Overall, CompoNeRF not only enhances the fidelity and consistency of 3D scenes but also offers a more flexible and editable approach to multi-object 3D composition. This framework paves the way for innovative applications in areas requiring accurate and editable 3D scene generation from textual descriptions. The code for CompoNeRF is available publicly, fostering further research and development in this domain.

PDF Markdown

GitHub

GitHub - hbai98/Componerf (2 stars)

CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout (2303.13843v5)

Summary

Related Papers

GitHub