Local 3D Editing via 3D Distillation of CLIP Knowledge (2306.12570v1)

Published 21 Jun 2023 in cs.CV

Abstract: 3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as 2D semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome these problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the zero-shot mask generation capability of CLIP to the 3D space with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces LENeRF, enabling fine-grained local edits in 3D models by distilling CLIP’s zero-shot mask generation.
The paper combines a Latent Residual Mapper, an Attention Field Network, and a Deformation Network to achieve precise, localized modifications.
Experimental evaluations show that LENeRF delivers high-quality visual results and robust performance in varied 3D editing scenarios.

"Local 3D Editing via 3D Distillation of CLIP Knowledge" presents an innovative approach to addressing the limitations in Neural Radiance Fields (NeRF) based 3D content manipulation. The paper targets the issue of maintaining high visual quality and achieving fine-grained, localized edits without relying on suboptimal control mechanisms like 2D semantic maps. The authors introduce Local Editing NeRF (LENeRF), a novel framework that leverages text inputs for nuanced local modifications within 3D spaces.

Key to LENeRF are three add-on modules:

Latent Residual Mapper: This module fine-tunes latent representations to facilitate subtle, localized adjustments in the generated 3D content.
Attention Field Network: This component calculates a 3D attention field, enabling the system to focus on specific regions within the 3D space during edits.
Deformation Network: This network allows for geometric changes within the 3D model, effectively deforming the content based on the attention field's guidance.

A notable contribution of this work is the unsupervised learning of the 3D attention field. The authors achieve this by distilling the capabilities of CLIP for zero-shot mask generation into the 3D domain, guided by multi-view perspectives. By doing so, the system inherits CLIP's ability to understand and generate masks based on textual descriptions, enriched within a 3D context.

Through their experimental evaluations, both quantitative and qualitative, the researchers demonstrate that LENeRF can perform fine-grained and high-quality local edits in 3D models. This advancement holds potential for real-world applications in fields like product design, animation, and avatar customization, enhancing the flexibility and precision of 3D content manipulation using text inputs.

PDF Markdown

Local 3D Editing via 3D Distillation of CLIP Knowledge (2306.12570v1)

Summary

Related Papers