Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields (2203.10821v2)

Published 21 Mar 2022 in cs.CV, cs.AI, and cs.GR

Abstract: Image translation and manipulation have gain increasing attention along with the rapid development of deep generative models. Although existing approaches have brought impressive results, they mainly operated in 2D space. In light of recent advances in NeRF-based 3D-aware generative models, we introduce a new task, Semantic-to-NeRF translation, that aims to reconstruct a 3D scene modelled by NeRF, conditioned on one single-view semantic mask as input. To kick-off this novel task, we propose the Sem2NeRF framework. In particular, Sem2NeRF addresses the highly challenging task by encoding the semantic mask into the latent code that controls the 3D scene representation of a pre-trained decoder. To further improve the accuracy of the mapping, we integrate a new region-aware learning strategy into the design of both the encoder and the decoder. We verify the efficacy of the proposed Sem2NeRF and demonstrate that it outperforms several strong baselines on two benchmark datasets. Code and video are available at https://donydchen.github.io/sem2nerf/

Citations (34)

View on Semantic Scholar

Summary

The paper proposes Sem2NeRF, a framework that introduces the task of translating single-view 2D semantic masks into 3D neural radiance fields (NeRFs).
Sem2NeRF employs a region-aware learning strategy within its architecture to effectively translate 2D semantic masks into accurate 3D scene representations.
The model demonstrates superior performance over baseline methods on datasets like CelebAMask-HQ and CatMask, achieving higher cross-view consistency and image quality.

Overview of Sem2NeRF: Semantic Mask to Neural Radiance Field Translation

The paper "Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields" proposes a framework that translates single-view semantic masks into 3D neural radiance fields (NeRFs). The authors introduce a new task, Semantic-to-NeRF translation, which aims to generate comprehensive 3D scene representations from 2D semantic inputs. This task offers substantial challenges due to the inherent information gap between dense 3D structures and sparse 2D semantics, compounded by varied semantic region distribution.

Core Contributions

The primary contribution of this paper is the Sem2NeRF framework designed for Semantic-to-NeRF translation. The framework's mechanism involves encoding semantic masks into latent codes that influence the 3D scene representation by a pre-trained decoder. The authors simultaneously propose a region-aware learning strategy within both the encoder and decoder to enhance the fidelity of semantic mappings. This innovative approach allows Sem2NeRF to outperform various strong baselines on benchmark datasets, specifically CelebAMask-HQ and CatMask.

Key Innovations

Semantic-to-NeRF Task: The conceptualization of translating semantic masks to NeRFs is an advancement, leveraging advancements in 3D-aware generative models to bridge the 2D-3D semantic gap.
Region-Aware Learning Strategy: This strategy mitigates challenges related to information sparsity and semantic class imbalance. The proposed method improves attention to small, significant semantic changes by introducing regional sensitivity during learning.
Encoder and Decoder Integration: The use of Swin Transformer for the 2D semantic mask encoder and a $\pi$ -GAN-based NeRF generator enables efficient mapping from semantic space to high-dimensional 3D space, allowing for nuanced image manipulation and content editing.
Adaptability and Extension: The proposed framework is tested on typical subjects like human faces and is also extended to other domains such as cats using generative pseudo-labeling approaches, signifying its general applicability.

Empirical Evaluations

Empirical validation indicates that Sem2NeRF achieves superior performance compared to several existing techniques, like pSp, pix2pixHD with GAN inversion, and SofGAN, in terms of cross-view image consistency and quality, highlighted by strong FID and IS metrics.

Implications and Future Directions

The Sem2NeRF model sets a precedent for nuanced 3D scene synthesis conditioned on 2D inputs, expanding possibilities for fields requiring high-dimensional vision applications, including AR/VR environments and interactive design platforms. Future investigations might explore complex scenes or improve semantic editing precision. Additionally, increasing the efficiency and adaptability of the learning strategies could accommodate real-time systems and more diverse datasets.

Overall, the introduction of Semantic-to-NeRF translation and the Sem2NeRF framework represent pivotal progress toward comprehensive and flexible 3D scene understanding and generation, suggesting promising implications for ongoing and future AI developments in image processing and synthesis.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/rsasaki0109/status/1648619331318853632