CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout (2303.13843v5)
Abstract: Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.
- Neural rgb-d surface reconstruction. In CVPR, pages 6290–6301, 2022.
- Dynamic plenoctree for adaptive sampling refinement in explicit nerf. ArXiv, abs/2307.15333, 2023.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5835–5844, 2021a.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5835–5844. IEEE, 2021b.
- Mip-nerf 360: Unbounded anti-aliased neural radiance fields. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5460–5469, 2021c.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Dynamic view synthesis from dynamic monocular video. In ICCV, pages 5712–5721, 2021.
- Get3d: A generative model of high quality 3d textured shapes learned from images. 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
- Zero-shot text-guided object generation with dream fields. In CVPR, pages 867–876, 2022.
- Sdf-3dgan: A 3d object generative method based on implicit signed distance function. arXiv preprint arXiv:2303.06821, 2023.
- Ray tracing volume densities. ACM SIGGRAPH, 18(3):165–174, 1984.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Understanding pure clip guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172, 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Autoint: Automatic integration for fast neural volume rendering. In CVPR, pages 14556–14565, 2021.
- Neural sparse voxel fields. ArXiv, abs/2007.11571, 2020.
- Neural actor: Neural free-view synthesis of human actors with pose control. ACM Transactions on Graphics (TOG), 40(6):1–16, 2021.
- Practical physically-based shading in film and game production. In ACM SIGGRAPH 2012 Courses, 2012.
- Latent-nerf for shape-guided generation of 3d shapes and textures. arXiv preprint arXiv:2211.07600, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020a.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421. Springer, 2020b.
- Laterf: Label and text driven object radiance fields. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pages 20–36. Springer, 2022.
- Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 Conference Papers, pages 1–8, 2022.
- Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (TOG), 41:1 – 15, 2022a.
- Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989, 2022b.
- Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
- Neural scene graphs for dynamic scenes. In CVPR, pages 2856–2865, 2021.
- Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
- D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327, 2021.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, pages 18603–18613, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. In Neural Information Processing Systems, 2021.
- Towards efficient neural scene graphs by learning consistency fields. arXiv preprint arXiv:2210.04127, 2022.
- Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In CVPR, pages 7495–7504, 2021.
- Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In ICCV, pages 12959–12970, 2021.
- Ref-nerf: Structured view-dependent appearance for neural radiance fields. In CVPR, pages 5481–5490. IEEE, 2022.
- Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844, 2022a.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022b.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems, 34:27171–27183, 2021.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18359–18369, 2023.
- Object-compositional neural implicit surfaces. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pages 197–213. Springer, 2022.
- Space-time neural irradiance fields for free-viewpoint video. In CVPR, pages 9421–9431, 2021.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022a.
- Discoscene: Spatially disentangled generative radiance fields for controllable 3d-aware scene synthesis. arXiv preprint arXiv:2212.11984, 2022b.
- Learning object-compositional neural radiance field for editable scene rendering. In ICCV, pages 13779–13788, 2021.
- Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation. arXiv preprint arXiv:2307.13908, 2023.
- Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Transactions on Graphics (TOG), 40(6):1–18, 2021.
- Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023.
- In-place scene labelling and understanding with implicit scene representation. In ICCV, pages 15838–15847, 2021.
- Haotian Bai (10 papers)
- Yuanhuiyi Lyu (25 papers)
- Lutao Jiang (13 papers)
- Sijia Li (33 papers)
- Haonan Lu (35 papers)
- Xiaodong Lin (31 papers)
- Lin Wang (403 papers)