Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SIGNeRF: Scene Integrated Generation for Neural Radiance Fields (2401.01647v2)

Published 3 Jan 2024 in cs.CV and cs.GR

Abstract: Advances in image diffusion models have recently led to notable improvements in the generation of high-quality images. In combination with Neural Radiance Fields (NeRFs), they enabled new opportunities in 3D generation. However, most generative 3D approaches are object-centric and applying them to editing existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation. A new generative update strategy ensures 3D consistency across the edited images, without requiring iterative optimization. We find that depth-conditioned diffusion models inherently possess the capability to generate 3D consistent views by requesting a grid of images instead of single views. Based on these insights, we introduce a multi-view reference sheet of modified images. Our method updates an image collection consistently based on the reference sheet and refines the original NeRF with the newly generated image set in one go. By exploiting the depth conditioning mechanism of the image diffusion model, we gain fine control over the spatial location of the edit and enforce shape guidance by a selected region or an external mesh.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. AUTOMATIC1111. Stable diffusion webui. https://github.com/AUTOMATIC1111/stable-diffusion-webui, 2022.
  2. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. pages 20919–20929, 2023.
  3. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. pages 5470–5479, 2021.
  4. Instructpix2pix: Learning to follow image editing instructions. pages 18392–18402, 2022.
  5. Ricardo Cabello. Three.js, 2010.
  6. Text2shape: Generating shapes from natural language by learning joint embeddings. pages 100–116, 2019.
  7. Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. 2022.
  8. Set-the-scene: Global-local training for generating controllable nerf scenes. arXiv preprint arXiv:2303.13450, 2023.
  9. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. pages 20637–20647, 2022.
  10. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2021.
  11. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
  12. Blended-nerf: Zero-shot object generation and blending in existing neural radiance fields. arXiv preprint arXiv:2306.12760, 2023.
  13. Instruct-nerf2nerf: Editing 3d scenes with instructions. 2023.
  14. Chris Heinrich. Polycam: Lidar scanning app for iphone. https://poly.cam/, 2023. Polycam Inc. provides a fast and accurate 3D scanning app leveraging the LiDAR sensor on the iPhone.
  15. Paul Henschel. React three fiber, 2019.
  16. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  17. Lora: Low-rank adaptation of large language models. 2022.
  18. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422, 2023.
  19. Zero-shot text-guided object generation with dream fields. pages 867–876, 2022.
  20. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014.
  21. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, volume 42(4), July 2023, 42(4):1–14, 2023.
  22. Clip-mesh: Generating textured meshes from text using pretrained image-text models. 2022.
  23. Control-nerf: Editable feature volumes for scene rendering and manipulation. pages 4340–4350, 2022.
  24. Diffusion-sdf: Text-to-shape via voxelized diffusion. pages 12642–12651, 2022.
  25. Interactive geometry editing of neural radiance fields. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(1), 2023.
  26. Magic3d: High-resolution text-to-3d content creation. 2022.
  27. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023a.
  28. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023b.
  29. Meta. React, 2013.
  30. Latent-nerf for shape-guided generation of 3d shapes and textures. pages 12663–12673, 2023.
  31. Text2mesh: Text-driven neural stylization for meshes. pages 13492–13502, 2021.
  32. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2020.
  33. Diffrf: Rendering-guided 3d radiance field diffusion. pages 4328–4338, 2022a.
  34. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41, 4, Article 102 (July 2022), 15 pages, 41(4):1–15, 2022b.
  35. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  36. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  37. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
  38. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  39. Dreamfusion: Text-to-3d using 2d diffusion. 2023.
  40. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021.
  41. High-resolution image synthesis with latent diffusion models. pages 10684–10695, 2021.
  42. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. pages 22500–22510, 2022.
  43. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  44. Clip-forge: Towards zero-shot text-to-shape generation. pages 18603–18613, 2021.
  45. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  46. Laion-5b: An open large-scale dataset for training next generation image-text models. 36, 2022.
  47. Vox-e: Text-guided voxel editing of 3d objects. arXiv preprint arXiv:2303.12048, 2023.
  48. Controlnetinpaint: Inpaint images with controlnet. https://github.com/mikonvergence/ControlNetInpaint, 2023. GitHub repository.
  49. Deep unsupervised learning using nonequilibrium thermodynamics. pages 2256–2265, 2015.
  50. Nerfstudio: A modular framework for neural radiance field development. arXiv preprint arXiv:2302.04264, 2023.
  51. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023.
  52. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  53. Sketch-guided text-to-image diffusion models. arXiv preprint arXiv:2211.13752, 2022.
  54. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  55. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  56. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360° views. pages 4479–4489, 2022.
  57. Deforming radiance fields with cages. pages 159–175, 2022.
  58. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. pages 597–614, 2022.
  59. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. pages 1790–1799, 2019.
  60. Nerf-editing: Geometry editing of neural radiance fields. pages 18353–18364, 2022.
  61. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  62. The unreasonable effectiveness of deep features as a perceptual metric. pages 586–595, 2018.
  63. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. pages 12588–12597, 2022.
  64. Dreameditor: Text-driven 3d scene editing with neural fields, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jan-Niklas Dihlmann (2 papers)
  2. Andreas Engelhardt (6 papers)
  3. Hendrik Lensch (4 papers)
Citations (3)

Summary

  • The paper introduces SIGNeRF, a method that integrates generative diffusion models with reference-sheet updates to refine NeRF scene edits in a single operation.
  • The paper demonstrates superior performance in object generation and editing, achieving higher fidelity and speed as measured by CLIP, PSNR, and SSIM metrics.
  • The paper offers a modular pipeline that enables fast, controllable, and previewable 3D scene editing, outperforming existing solutions in selection precision and scene preservation.

Introduction

This paper introduces SIGNeRF, an innovative method for editing existing 3D scenes and integrating new objects with high fidelity using generative 2D diffusion models. Traditional approaches require complex pipelines, iterative optimizations, and lack precise control over the results. SIGNeRF addresses these challenges by leveraging a reference-sheet-based update strategy, ensuring 3D consistency across edits. This novel strategy for generating and updating image sets is based on a multi-view reference sheet made up of modified images, which then refine the original Neural Radiance Fields (NeRF) scene in a single operation.

Background and Related Work

In terms of related work, the paper reflects on the advancements in text-to-image and text-to-3D generation, acknowledging the integration of diffusion probabilistic models with large datasets. Using these models, researchers have been able to generate high-resolution and diverse images. Moreover, ControlNet, an image diffusion model with additional depth guidance, has shown inherent capabilities of generating coherent and consistent views. The paper also discusses the challenges of editing NeRF scenes and how current solutions provide a limited range of capabilities. Lastly, generative NeRF editing is considered, showing that there's a growing interest in modifying existing NeRF scenes by applying generative 3D models.

Methodology

The proposed SIGNeRF pipeline comprises several stages, starting with the training of an original NeRF scene. Upon selecting the 3D region to edit, reference cameras are placed, and corresponding color, depth, and mask images are rendered and arranged into image grids. ControlNet processes these grids to produce a reference sheet, which is then used to iteratively update images in the NeRF dataset to ensure multi-view consistency. The editing is fine-tuned by introducing two selection methods in the scene space: a mesh proxy and a bounding box selection mode. An optional second iteration can be performed if necessary, and due to the modular pipeline, individual components can be fine-tuned or exchanged easily.

Outcomes and Comparison

The SIGNeRF pipeline yields superior results in object generation and editing with consistent style across all views. When compared to existing methods such as Instruct-NeRF2NeRF and DreamEditor, SIGNeRF demonstrates improved performance in terms of scene preservation, selection precision, and generation quality. Notably, it enables more complex object edits and can preview edits before generating the complete updated image set. It also outperforms in terms of generation speed, taking half as much time as other methods. Quantitative evaluations using CLIP text-to-image directional similarity and metrics such as PSNR and SSIM indicate SIGNeRF's advantages in preserving unedited parts of the scene and achieving higher fidelity to the text prompts. However, the method has limitations, such as reduced edit quality for objects far from the camera and challenges in editing off-center objects or extensive scene modifications.

Conclusions

SIGNeRF presents a breakthrough in scene-integrated editing for NeRF scenes, offering a fast, controllable, and customizable approach to 3D generation. It leads to more consistent edits in single runs, optimizes the process compared to current editing methods, and provides an initial preview to users. Although focused on NeRF, the modularity of SIGNeRF enables adaptation to other 3D scene representations. Despite potential misuses of technology in creating convincing forgeries, the authors hope SIGNeRF will further democratize 3D content generation, ultimately benefiting the broader field.