Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model (2306.09551v1)

Published 15 Jun 2023 in cs.CV

Abstract: Recent research has demonstrated that the combination of pretrained diffusion models with neural radiance fields (NeRFs) has emerged as a promising approach for text-to-3D generation. Simply coupling NeRF with diffusion models will result in cross-view inconsistency and degradation of stylized view syntheses. To address this challenge, we propose the Edit-DiffNeRF framework, which is composed of a frozen diffusion model, a proposed delta module to edit the latent semantic space of the diffusion model, and a NeRF. Instead of training the entire diffusion for each scene, our method focuses on editing the latent semantic space in frozen pretrained diffusion models by the delta module. This fundamental change to the standard diffusion framework enables us to make fine-grained modifications to the rendered views and effectively consolidate these instructions in a 3D scene via NeRF training. As a result, we are able to produce an edited 3D scene that faithfully aligns to input text instructions. Furthermore, to ensure semantic consistency across different viewpoints, we propose a novel multi-view semantic consistency loss that extracts a latent semantic embedding from the input view as a prior, and aim to reconstruct it in different views. Our proposed method has been shown to effectively edit real-world 3D scenes, resulting in 25% improvement in the alignment of the performed 3D edits with text instructions compared to prior work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–15, 2023.
  2. CARLA: an open urban driving simulator. In CoRL, pages 1–16, 2017.
  3. Instruct-NeRF2NeRF: Editing 3d scenes with instructions. arXiv preprint 2303.12789, 2023.
  4. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), pages 1–38, 2018.
  5. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 1–12, 2020.
  6. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 857–866, 2022.
  7. HOLODIFFUSION: training a 3d diffusion model using 2d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–13, 2023.
  8. CLIP-Mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH, pages 25:1–25:8, 2022.
  9. Editing conditional radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5753–5763, 2021.
  10. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020.
  11. DiffRF: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–11, 2023.
  12. Extracting triangular 3d models, materials, and lighting from images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  13. Neural scene graphs for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2856–2865, 2021.
  14. Photoshape: photorealistic materials for large-scale shape collections. ACM Trans. Graph., 37(6):192, 2018.
  15. DreamFusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  16. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763, 2021.
  17. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
  18. Structure-from-motion revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016.
  19. GRAF: Generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  20. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  21. Exploring compositional visual generation with latent classifier guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–10, 2023.
  22. NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7495–7504, 2021.
  23. Nerfstudio: A modular framework for neural radiance field development. arXiv preprint arXiv:2302.04264, 2023.
  24. Deformation-aware 3d model embedding and retrieval. In ECCV, pages 397–413, 2020.
  25. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5481–5490, 2022.
  26. CLIP-NeRF: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3825–3834, 2022.
  27. Pixel2Mesh: 3d mesh model generation via image guided deformation. IEEE Trans. Pattern Anal. Mach. Intell., 43(10):3600–3613, 2021.
  28. Learning object-compositional neural radiance field for editable scene rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13759–13768, 2021.
  29. Unsupervised discovery of object radiance fields. In ICLR, 2022.
  30. NeRF-editing: Geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18332–18343, 2022.
  31. Unsupervised representation learning from pre-trained diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  32. SparseFusion: Distilling view-conditioned diffusion for 3d reconstruction. arXiv preprint arXiv:2212.00792, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Lu Yu (87 papers)
  2. Wei Xiang (106 papers)
  3. Kang Han (7 papers)
Citations (15)

Summary

"Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model" is a cutting-edge research paper that addresses a significant challenge in text-to-3D generation using neural radiance fields (NeRFs) and pretrained diffusion models. The combination of these technologies has shown promise, but conventional methods often suffer from cross-view inconsistencies and a degradation in the stylized synthesis of views.

To mitigate these issues, the authors propose the Edit-DiffNeRF framework, comprising three main components:

  1. Frozen Diffusion Model: Instead of retraining the entire diffusion model for each scene, the authors retain the pretrained model as it is.
  2. Delta Module: This module is introduced to edit the latent semantic space of the frozen diffusion model. By focusing on editing the semantic space rather than retraining from scratch, the approach allows for fine-grained modifications aligned with text instructions.
  3. NeRF: Integrates with the above components to generate coherent and consistent 3D scenes.

The fundamental innovation lies in the delta module, which allows for the fine-tuning of the latent semantic space. This enables precise modifications to the 2D diffusion model's output, which are then faithfully translated into the 3D domain via NeRF. Notably, this method avoids the need for extensive retraining, making it more efficient.

Additionally, the authors introduce a multi-view semantic consistency loss that plays a critical role in ensuring that the semantic information is consistently maintained across different viewpoints. This loss function works by extracting a latent semantic embedding from the input view and aiming to reconstruct it accurately in different views, thereby improving the overall coherence and alignment of the 3D scene with the input text.

Empirical results demonstrate the efficacy of Edit-DiffNeRF, with the method achieving a 25% improvement in aligning 3D edits with text instructions compared to previous approaches. This significant enhancement underlines the framework's capability to edit real-world 3D scenes effectively, maintaining both visual and semantic consistency across multiple views.

In summary, Edit-DiffNeRF presents a novel approach to overcoming the challenges in text-to-3D generation by editing latent semantic spaces of frozen diffusion models, ensuring fine-tuned, coherent 3D scene synthesis in alignment with user-provided text instructions.