Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NeRFiller: Completing Scenes via Generative 3D Inpainting (2312.04560v1)

Published 7 Dec 2023 in cs.CV, cs.AI, and cs.GR

Abstract: We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpainting problem by leveraging a 2D inpainting diffusion model. We identify a surprising behavior of these models, where they generate more 3D consistent inpaints when images form a 2$\times$2 grid, and show how to generalize this behavior to more than four images. We then present an iterative framework to distill these inpainted regions into a single consistent 3D scene. In contrast to related works, we focus on completing scenes rather than deleting foreground objects, and our approach does not require tight 2D object masks or text. We compare our approach to relevant baselines adapted to our setting on a variety of scenes, where NeRFiller creates the most 3D consistent and plausible scene completions. Our project page is at https://ethanweber.me/nerfiller.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Visual prompting via image inpainting. In ANeurIPS, 2022.
  2. Multidiffusion: Fusing diffusion paths for controlled image generation. In ICML, 2023.
  3. Zoedepth: Zero-shot transfer by combining relative and metric depth. In arXiv preprint arXiv:2302.12288, 2023.
  4. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  5. Immersive light field video with a layered mesh representation. In SIGGRAPH, 2020.
  6. Diffdreamer: Consistent single-view perpetual view generation with conditional diffusion models. In ICCV, 2023.
  7. Persistent nature: A generative model of unbounded 3d worlds. In CVPR, 2023.
  8. Generative novel view synthesis with 3d-aware diffusion models. In arXiv, 2023.
  9. Scenedreamer: Unbounded 3d scene generation from 2d image collections. 2023.
  10. Objaverse-xl: A universe of 10m+ 3d objects. In arXiv, 2023a.
  11. Objaverse: A universe of annotated 3d objects. In CVPR, 2023b.
  12. Unconstrained scene generation with locally conditioned radiance fields. In ICCV, 2021.
  13. Texture synthesis by non-parametric sampling. In ICCV, 1999.
  14. Scenescape: Text-driven consistent scene generation. In arXiv, 2023.
  15. An image is worth one word: Personalizing text-to-image generation using textual inversion. In arXiv, 2022.
  16. Tokenflow: Consistent diffusion features for consistent video editing. In arXiv, 2023.
  17. Bayes’ Rays: Uncertainty quantification in neural radiance fields. In arXiv, 2023.
  18. Instruct-nerf2nerf: Editing 3d scenes with instructions. In ICCV, 2023.
  19. Instant 3D Photography. In SIGGRAPH, 2018.
  20. Casual 3D Photography. In SIGGRAPH Asia, 2017.
  21. Classifier-free diffusion guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2022.
  22. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  23. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV, 2023.
  24. Musiq: Multi-scale image quality transformer. In ICCV, 2021.
  25. 3d gaussian splatting for real-time radiance field rendering. In SIGGRAPH, 2023.
  26. Pathdreamer: A world model for indoor navigation. In ICCV, 2021.
  27. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. In NeurIPS, 2023.
  28. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In ECCV, 2022.
  29. Infinite nature: Perpetual view generation of natural scenes from a single image. In ICCV, 2021.
  30. Syncdreamer: Learning to generate multiview-consistent images from a single-view image. In arXiv, 2023.
  31. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  32. Realfusion: 360deg reconstruction of any object from a single image. In CVPR, 2023.
  33. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. In SIGGRAPH, 2019.
  34. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  35. Reference-guided controllable inpainting of neural radiance fields. In ICCV, 2023a.
  36. SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023b.
  37. Snerf: stylized neural implicit representations for 3d scenes. In SIGGRAPH, 2022.
  38. State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204, 2023.
  39. Dreamfusion: Text-to-3d using 2d diffusion. In ICML, 2023.
  40. Dreambooth3d: Subject-driven text-to-3d generation. In ICCV, 2023.
  41. Pixelsynth: Generating a 3d-consistent experience from a single image. In ICCV, 2021.
  42. Ganerf: Leveraging discriminators to optimize neural radiance fields. In SIGGRAPH Asia, 2023.
  43. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  44. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  45. Loftr: Detector-free local feature matching with transformers. In CVPR, 2021.
  46. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. In arXiv, 2023.
  47. Resolution-robust large mask inpainting with fourier convolutions. In WACV, 2022.
  48. Nerfstudio: A modular framework for neural radiance field development. In SIGGRAPH Conference Proceedings, 2023.
  49. Realfill: Reference-driven generation for authentic image completion. In arXiv, 2023a.
  50. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. In arXiv, 2023b.
  51. Single-view view synthesis with multiplane images. In CVPR, 2020.
  52. Inpaintnerf360: Text-guided 3d inpainting on unbounded neural radiance fields. In arXiv, 2023a.
  53. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In ICCV, 2023b.
  54. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, 2023c.
  55. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023d.
  56. Nerfbusters: Removing ghostly artifacts from casually captured nerfs. In ICCV, 2023.
  57. Novel view synthesis with diffusion models. In arXiv, 2022.
  58. Removing objects from neural radiance fields. In CVPR, 2023.
  59. Synsin: End-to-end view synthesis from a single image. In CVPR, 2020.
  60. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  61. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  62. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In CVPR, 2023.
Citations (15)

Summary

  • The paper introduces a method that fills missing 3D scene parts by arranging 2D inpainting results in a joint multi-view grid.
  • It demonstrates enhanced scene coherence and plausibility compared to traditional 3D reconstruction techniques.
  • The approach eliminates the need for tight masks or textual prompts, offering controllable completion through reference images.

Overview of Generative 3D Inpainting

The emergence of 3D scene capture technology has accelerated the creation of immersive worlds but often suffers from incomplete data due to occlusions or missing observations. Bridging these gaps in 3D environments is crucial for applications ranging from virtual reality to film production. A novel approach, NeRFiller (Neural Radiance Filler), addresses the challenge by introducing a generative 3D inpainting strategy that utilizes existing 2D image inpainting models to effectively complete three-dimensional scenes.

The Shortcomings in Capturing Complete 3D Scenes

3D scanning, while sophisticated, frequently results in scenes with unobserved regions or undesired elements. Editing these 3D captures to fill in or modify content requires consistency across multiple views - a task that proves difficult when using models oriented toward 2D image generation which lacks inherent 3D understanding.

NeRFiller's Innovative Approach

NeRFiller leverages the capabilities of 2D inpainting diffusion models, uncovering their propensity to produce more consistent three-dimensional inpaints when multiple images are arranged in a specific grid pattern. This discovery is harnessed in a new technique, Joint Multi-View Inpainting, which allows more than four images to be inpainted with increased multi-view consistency. In an iterative process, these 2D inpaints are distilled into a cohesive 3D scene representation, resulting in plausible and 3D-consistent scene completions.

The innovation does not require tight object masks or textual prompts, relying on scene context alone. It stands apart from baseline methods that focus on either generating new scenes from scratch or removing objects, offering a targeted remedy for scenes with partial data.

Implementation and Results

NeRFiller's effectiveness is demonstrated through comparisons with existing techniques across a variety of scenes. The approach has shown promising results in completing scenes more coherently and plausibly than competitors. An aspect of NeRFiller enables user control over the inpainting process by using reference images to guide the outcome.

Limitations and Future Directions

Despite substantial progress, NeRFiller is challenged by creating high-resolution details in regions far from observation points, and applying the method to casual captures currently poses difficulties due to the out-of-distribution mask patterns for existing inpainting models. These areas present opportunities for future work.

Conclusion

NeRFiller takes significant strides in the field of 3D content generation. By providing a method for skilled scene completion that is conditioned on multi-view images, it unlocks new potentials for the refinement of 3D captures, paving the way toward more seamless and intricate virtual environments.