Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields (2305.11588v2)

Published 19 May 2023 in cs.CV and cs.GR

Abstract: Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts. Our code is available at https://github.com/eckertzhang/Text2NeRF.

PDF Abstract

Overview of Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

The paper presents Text2NeRF, a method for generating complex 3D scenes from textual descriptions by leveraging Neural Radiance Fields (NeRF) and text-to-image diffusion models. The paper addresses limitations in existing text-to-3D generation methods, which often produce unrealistic 3D objects with simple geometries, by introducing a framework capable of creating photo-realistic scenes with intricate geometries and textures.

Methodology

The method integrates NeRF with diffusion models to generate 3D scenes that align with a given text prompt. NeRF is employed for its robust representation of complex scenes, capturing both geometric and textural details. The approach utilizes a pre-trained text-to-image diffusion model to create an initial image and a monocular depth estimation to provide geometric priors. These priors inform the NeRF optimization process, allowing for detailed scene synthesis without additional 3D data.

The authors introduce a progressive scene inpainting and updating strategy to ensure geometric and textural consistency across different views. This strategy overcomes the issue of overfitting in single-view NeRF training by utilizing a support set constructed via depth image-based rendering (DIBR). Additionally, a two-stage depth alignment is employed to mitigate depth inconsistencies across views, which improves the geometric stability and fidelity of the generated scenes.

Experimental Results

The authors conducted extensive experiments demonstrating that Text2NeRF outperforms existing methods in generating high-quality, multi-view consistent, and diverse 3D scenes from various text prompts. The quantitative metrics show improvements over baseline models in image quality assessments such as BRISQUE and NIQE, and enhanced text-image semantic alignment indicated by higher CLIP similarity scores.

Implications and Future Directions

Text2NeRF represents a significant step forward in bridging the gap between text and 3D content generation. Its capacity to produce detailed and realistic scenes offers valuable applications in fields like gaming, virtual reality, and digital content creation, where demand for high-quality 3D content is continually increasing. The paper highlights the importance of low-level and high-level content integration for achieving realistic scene synthesis.

Future developments may focus on alleviating limitations such as inaccuracies in depth estimation and the computational efficiency of NeRF optimization. Expanding the framework to handle broader categories of scenes or incorporating multi-modal inputs could further enhance the versatility and applicability of text-driven 3D generation technologies. Additionally, advancements in diffusion models and neural rendering could further improve the fidelity and realism of synthesized scenes, pushing the boundaries of what is feasible with text-to-3D systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jingbo Zhang (43 papers)
Xiaoyu Li (348 papers)
Ziyu Wan (32 papers)
Can Wang (156 papers)
Jing Liao (100 papers)

Citations (62)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - eckertzhang/Text2NeRF: Official implementation of 'Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields' (102 stars)

Tweets

https://twitter.com/BitBiblio/status/1806435660984643871