NeRFiller: Completing Scenes via Generative 3D Inpainting (2312.04560v1)

Published 7 Dec 2023 in cs.CV, cs.AI, and cs.GR

Abstract: We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpainting problem by leveraging a 2D inpainting diffusion model. We identify a surprising behavior of these models, where they generate more 3D consistent inpaints when images form a 2$\times$2 grid, and show how to generalize this behavior to more than four images. We then present an iterative framework to distill these inpainted regions into a single consistent 3D scene. In contrast to related works, we focus on completing scenes rather than deleting foreground objects, and our approach does not require tight 2D object masks or text. We compare our approach to relevant baselines adapted to our setting on a variety of scenes, where NeRFiller creates the most 3D consistent and plausible scene completions. Our project page is at https://ethanweber.me/nerfiller.

References (62)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces a method that fills missing 3D scene parts by arranging 2D inpainting results in a joint multi-view grid.
It demonstrates enhanced scene coherence and plausibility compared to traditional 3D reconstruction techniques.
The approach eliminates the need for tight masks or textual prompts, offering controllable completion through reference images.

Overview of Generative 3D Inpainting

The emergence of 3D scene capture technology has accelerated the creation of immersive worlds but often suffers from incomplete data due to occlusions or missing observations. Bridging these gaps in 3D environments is crucial for applications ranging from virtual reality to film production. A novel approach, NeRFiller (Neural Radiance Filler), addresses the challenge by introducing a generative 3D inpainting strategy that utilizes existing 2D image inpainting models to effectively complete three-dimensional scenes.

The Shortcomings in Capturing Complete 3D Scenes

3D scanning, while sophisticated, frequently results in scenes with unobserved regions or undesired elements. Editing these 3D captures to fill in or modify content requires consistency across multiple views - a task that proves difficult when using models oriented toward 2D image generation which lacks inherent 3D understanding.

NeRFiller's Innovative Approach

NeRFiller leverages the capabilities of 2D inpainting diffusion models, uncovering their propensity to produce more consistent three-dimensional inpaints when multiple images are arranged in a specific grid pattern. This discovery is harnessed in a new technique, Joint Multi-View Inpainting, which allows more than four images to be inpainted with increased multi-view consistency. In an iterative process, these 2D inpaints are distilled into a cohesive 3D scene representation, resulting in plausible and 3D-consistent scene completions.

The innovation does not require tight object masks or textual prompts, relying on scene context alone. It stands apart from baseline methods that focus on either generating new scenes from scratch or removing objects, offering a targeted remedy for scenes with partial data.

Implementation and Results

NeRFiller's effectiveness is demonstrated through comparisons with existing techniques across a variety of scenes. The approach has shown promising results in completing scenes more coherently and plausibly than competitors. An aspect of NeRFiller enables user control over the inpainting process by using reference images to guide the outcome.

Limitations and Future Directions

Despite substantial progress, NeRFiller is challenged by creating high-resolution details in regions far from observation points, and applying the method to casual captures currently poses difficulties due to the out-of-distribution mask patterns for existing inpainting models. These areas present opportunities for future work.

Conclusion

NeRFiller takes significant strides in the field of 3D content generation. By providing a method for skilled scene completion that is conditioned on multi-view images, it unlocks new potentials for the refinement of 3D captures, paving the way toward more seamless and intricate virtual environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/1724061858221654016/status/1733139581338538489

https://twitter.com/22146921/status/1733247747619475762