Virtual Multi-view Fusion for 3D Semantic Segmentation (2007.13138v1)

Published 26 Jul 2020 in cs.CV and eess.IV

Abstract: Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of meshes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to all prior multiview approaches and competitive with recent 3D convolution approaches.

Authors (7)

Abhijit Kundu (16 papers)
Xiaoqi Yin (8 papers)
Alireza Fathi (31 papers)
David Ross (12 papers)
Brian Brewington (2 papers)
Thomas Funkhouser (66 papers)
Caroline Pantofaru (15 papers)

Citations (152)

View on Semantic Scholar

Summary

Virtual Multi-view Fusion for 3D Semantic Segmentation

Semantic segmentation of 3D scenes is a pivotal task in computer vision, serving numerous applications such as semantic mapping, autonomous navigation, and site monitoring. Traditional methods for 3D semantic segmentation typically rely on 3D convolutional networks operating on sparse voxel grids, or alternatively, fuse 2D image segmentation results onto 3D surfaces. The paper "Virtual Multi-view Fusion for 3D Semantic Segmentation" introduces a novel approach that addresses the limitations of prior methods by utilizing virtual views synthesized from 3D models of scenes.

Methodological Insights

The proposed approach centers around generating virtual views of 3D scenes that optimize conditions for 2D semantic segmentation models. These views are synthesized with customizable camera intrinsics and extrinsics, including unnatural but beneficial viewpoints, to harness contextual information that surpasses typical real-world image capture limitations. Key innovations include:

Wide Field-of-View (FOV): Virtual views are rendered with a wide FOV, capturing extensive contextual information, which enhances the prediction capability of 2D semantic segmentation models.
Virtual Viewpoint Selection: The method strategically selects virtual camera positions that minimize occlusion and optimize view coverage, including positions that are physically impossible, such as views from behind walls.
Rendering with Additional Channels: Virtual images include additional channels like surface normals and coordinates, providing richer information to semantic segmentation networks.
Accurate Pixel-wise Fusion: The semantic labels from 2D predictions are projected onto 3D surfaces using precise camera parameters, reducing errors like label bleeding across occluded boundaries.
Leveraging Pre-training: The framework adapts 2D models pretrained on large-scale image datasets, such as ImageNet and COCO, thereby enhancing the segmentation performance by transferring learned features to the task of 3D segmentation.

Empirical Evaluation

Utilizing the ScanNet dataset for indoor 3D scene segmentation, the proposed method exhibits significant improvements over existing multiview approaches and is competitive with methods based on pure 3D convolutions. Notably, the approach achieves a 3D mean Intersection over Union (mIoU) of 74.6% on the ScanNet test set, outperforming other view-centric methods which typically do not rank highly on standard benchmarks.

Performance with Varying Training Sizes: The paper highlights that the multiview fusion method remains robust even when trained with fewer scenes, demonstrating its capacity for effective data augmentation through diverse virtual views.
Inference Efficiency: The method effectively reduces the requisite number of views during inference without significant performance degradation, achieving comparable segmentation accuracy with substantially fewer images than required by traditional multiview systems.

Implications and Future Directions

The research implies that synthetically enhancing the multiview approach with strategic view selection and rendering can close the performance gap with 3D convolutional networks, offering a less memory-intensive alternative. This approach, while rooted in classic multiview strategies, revives interest by demonstrating state-of-the-art competitive results through better exploitation of 2D learning.

Future investigations could explore further refinement of virtual camera parameters, more sophisticated feature fusion techniques on 3D surfaces, and the application of this approach to outdoor or dynamic scenes where 3D reconstruction might be inconsistent. This latter aspect could broaden the applicability of the approach in varied environments, featuring distinct challenges like temporal changes and outdoor occlusions.

In conclusion, this paper proposes a compelling alternative to pure 3D convolution approaches, leveraging virtual view synthesis to enhance 3D semantic segmentation accuracy and efficiency. Its contribution lies not only in performance improvements but also in advocating for revisiting and modernizing older methodologies with current computational capabilities and learning paradigms.

PDF Markdown

Related Papers

YouTube

Show All Videos