Decomposing NeRF for Editing via Feature Field Distillation (2205.15585v2)

Published 31 May 2022 in cs.CV and cs.GR

Abstract: Emerging neural radiance fields (NeRF) are a promising scene representation for computer graphics, enabling high-quality 3D reconstruction and novel view synthesis from image observations. However, editing a scene represented by a NeRF is challenging, as the underlying connectionist representations such as MLPs or voxel grids are not object-centric or compositional. In particular, it has been difficult to selectively edit specific regions or objects. In this work, we tackle the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes. We propose to distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors such as CLIP-LSeg or DINO into a 3D feature field optimized in parallel to the radiance field. Given a user-specified query of various modalities such as text, an image patch, or a point-and-click selection, 3D feature fields semantically decompose 3D space without the need for re-training and enable us to semantically select and edit regions in the radiance field. Our experiments validate that the distilled feature fields (DFFs) can transfer recent progress in 2D vision and language foundation models to 3D scene representations, enabling convincing 3D segmentation and selective editing of emerging neural graphics representations.

Authors (3)

Sosuke Kobayashi (19 papers)
Eiichi Matsumoto (7 papers)
Vincent Sitzmann (38 papers)

Citations (288)

View on Semantic Scholar

Summary

Decomposing NeRF for Editing via Feature Field Distillation

The paper "Decomposing NeRF for Editing via Feature Field Distillation" explores advancements in the editing of neural radiance fields (NeRFs), ideally suited for the complex task of high-quality 3D reconstruction and novel view synthesis from image data. However, these tasks are constrained owing to the non-object-centric nature of the connectionist models, such as MLPs or voxel grids, employed within NeRFs. The primary focus here lies in addressing the challenge of performing selective edits on NeRF-represented scenes by employing semantic decomposition. The authors propose a novel method leveraging feature field distillation that enables precise 3D scene editing through user-specific queries.

Methodological Insights

The paper introduces Distilled Feature Fields (DFFs) as a means to facilitate NeRF scene decomposition. DFFs utilize contemporary feature extractors like CLIP-LSeg and DINO to distill knowledge from 2D images into a 3D feature field optimized in parallel to the radiance field. This is achieved by training a scene-specific DFF via a distillation framework, capitalizing on pre-trained models in the image domain made feasible by large, high-quality datasets and extensive self-supervised learning.

By allowing a 3D feature field to encode semantic information across three-dimensional coordinates, the method enables editable operations within NeRFs without requiring retraining— a significant advantage given the computational intensity characteristic of NeRF training. Query-based editing is facilitated by directly mapping a range of user inquiries, be it in text, image patch, or point-and-click selection, to meaningful edits in the 3D scene representation.

Experimental Results

The experiments conducted validate the efficacy of the proposed DFF approach in transferring the semantic understanding embedded in potent 2D vision and LLMs to 3D representations. These experiments demonstrate high segmentation performance against benchmarks, employing labels from ScanNet and high-dataset fidelity. Notably, the experiments revealed significant performance gains in 3D segmentation when using DFFs over conventional, supervised 3D CNN techniques.

Furthermore, the authors illustrate a variety of 3D appearance and geometry editing operations, highlighting the versatility and practical usability of this methodology in real-world applications. The reported results effectively underscore the capability of DFFs to perform selective, multi-modal semantic edits within NeRF-represented 3D environments.

Implications and Future Directions

The proposed method's most immediate implication is enhancing the editability of complex NeRF scenes. By decoupling geometric structure from high-fidelity rendering, DFFs support a greater range of coherent scene modifications, thereby opening new possibilities in areas such as digital content creation, virtual reality, and augmented reality.

From a theoretical standpoint, this work expands the boundaries of neural scene representations, particularly in extending the utility of NeRFs from static rendering to dynamic, interactive scene manipulation. It suggests future explorations might include refining this semantic distillation further, improving the narrative part of 3D generated scenes without compromising on the quality or fidelity of new viewpoints rendered post-editing.

Moreover, as the paper hints, the ability to integrate new self-supervised and zero-shot learning models may enhance the versatility and application domain of DFFs, further streamlining 3D content editing workflows. This work also invites an examination of Volume Rendering priors and regularization techniques to refine the semantic mapping between 2D image features and their 3D scene instantiation for even finer control over scene edits.

In summary, this paper offers significant contributions to the domain of neural scene representations by advancing the editability and semantic control within NeRFs through distillation of powerful 2D feature extras—a forward step in personalizing and enhancing 3D reconstructions for diverse practical applications.

PDF Markdown