"CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout" addresses the challenges faced by current neural radiance fields (NeRFs) and diffusion models in generating complex 3D scenes based on textual descriptions. Despite the progress made in text-to-3D object generation, existing models struggle with accurately parsing and rendering multi-object environments, often failing to maintain consistency and fidelity when translating intricate semantic details from text prompts into 3D scenes.
To overcome these limitations, CompoNeRF introduces a novel framework that enhances the quality and flexibility of 3D scene generation by integrating an editable 3D scene layout along with sophisticated guidance mechanisms. The key innovations of CompoNeRF are as follows:
- Editable 3D Layout: The framework starts by interpreting complex text prompts into an editable 3D layout. This layout is populated with multiple NeRFs, each one paired with a specific subtext prompt to ensure precise and detailed depiction of individual objects within the scene.
- Object-specific and Scene-wide Guidance Mechanisms: CompoNeRF employs dual-level text guidance to mitigate ambiguity and enhance accuracy. This guidance system ensures that both individual objects and the overall scene composition are consistent with the provided textual description.
- Composition Module: A tailored composition module seamlessly integrates the individual NeRFs into a coherent 3D scene. This module plays a critical role in maintaining the visual and semantic consistency of the final rendered output.
- NeRF Decomposition: The modular design of CompoNeRF allows for NeRF decomposition, enabling flexible editing of the scene. Researchers or users can modify specific elements of the scene layout or adjust the text prompts to generate new compositions without starting from scratch.
- Performance Improvements: Leveraging the Stable Diffusion model, CompoNeRF demonstrates significant performance improvements, achieving up to a 54% increase in the multi-view CLIP score metric, which measures the alignment of 3D scene renderings to their corresponding text descriptions.
Overall, CompoNeRF not only enhances the fidelity and consistency of 3D scenes but also offers a more flexible and editable approach to multi-object 3D composition. This framework paves the way for innovative applications in areas requiring accurate and editable 3D scene generation from textual descriptions. The code for CompoNeRF is available publicly, fostering further research and development in this domain.