PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things (1903.01177v2)

Published 4 Mar 2019 in cs.CV and cs.RO

Abstract: We propose PanopticFusion, a novel online volumetric semantic mapping system at the level of stuff and things. In contrast to previous semantic mapping systems, PanopticFusion is able to densely predict class labels of a background region (stuff) and individually segment arbitrary foreground objects (things). In addition, our system has the capability to reconstruct a large-scale scene and extract a labeled mesh thanks to its use of a spatially hashed volumetric map representation. Our system first predicts pixel-wise panoptic labels (class labels for stuff regions and instance IDs for thing regions) for incoming RGB frames by fusing 2D semantic and instance segmentation outputs. The predicted panoptic labels are integrated into the volumetric map together with depth measurements while keeping the consistency of the instance IDs, which could vary frame to frame, by referring to the 3D map at that moment. In addition, we construct a fully connected conditional random field (CRF) model with respect to panoptic labels for map regularization. For online CRF inference, we propose a novel unary potential approximation and a map division strategy. We evaluated the performance of our system on the ScanNet (v2) dataset. PanopticFusion outperformed or compared with state-of-the-art offline 3D DNN methods in both semantic and instance segmentation benchmarks. Also, we demonstrate a promising augmented reality application using a 3D panoptic map generated by the proposed system.

PDF Abstract

PanopticFusion: Advancements in Semantic Mapping at the Level of Stuff and Things

The paper "PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things" introduces an innovative semantic mapping system, which involves understanding a 3D scene by integrating panoptic segmentation capabilities into the mapping process. This system is designed to identify and label regions considered as "stuff" (such as floors and walls) and "things" (discrete objects like chairs and tables), thereby merging large-scale 3D reconstruction with enhanced semantic comprehension.

Overview of the Proposed System

PanopticFusion leverages both 2D semantic and instance segmentation by employing networks like PSPNet and Mask R-CNN, respectively. It fuses these outputs to generate pixel-wise panoptic labels for incoming RGB frames. The system encodes these labels into a volumetric map alongside depth measurements by maintaining the consistency of instance IDs. Notably, PanopticFusion implements a spatially hashed volumetric map representation that facilitates large-scale scene reconstruction and labeled mesh extraction. A fully connected conditional random field (CRF) model is integrated for map regularization, with novel unary potential approximation and division strategies enabling efficient online CRF inference.

Performance and Benchmarking

The system was evaluated on the ScanNet v2 dataset, a broad dataset for indoor scene understanding. The results demonstrated that PanopticFusion outperforms or rivals state-of-the-art offline 3D deep neural network (DNN) methods in both semantic and instance segmentation benchmarks. Notably, PanopticFusion displayed strengths in recognizing smaller objects and those requiring contextual semantic distinction. The deployment proved beneficial in scenarios where quick and dense scene reconstruction is desired, such as real-time robotic path planning and contextual augmented reality (AR), where interaction with objects based on their semantics is crucial.

Technical Contributions

Integration of Panoptic Segmentation: PanopticFusion is the first semantic mapping system reported to effectively integrate panoptic segmentation, allowing for unified modeling of amorphous regions ("stuff") and countable objects ("things").
Volumetric Mapping Approach: The system employs a spatially hashed truncated signed distance field (TSDF) map representation, which supports large-scale 3D scene reconstruction and mesh production.
CRF-Based Map Regularization: By approximating unary potentials and dividing the map, PanopticFusion achieves significant improvements in semantic accuracy while maintaining computational efficiency.
Robust Recognition Results: Demonstrated superior or competitive results in 3D semantic and instance segmentation tasks compared to advanced offline 3D DNN methods.

Implications and Future Work

The implications of PanopticFusion are extensive in areas such as autonomous robotics and AR applications, where understanding environmental semantics is paramount for navigation and interaction tasks. The ability to process real-world scenes online allows for dynamic applications in environments with varying semantic elements and interaction contexts.

Future research directions could involve enhancing the system's scalability to ensure global scene consistency, optimizing network architectures to improve throughput, and adapting the framework to support dynamic and interactive environments. These developments could pave the way for further integration of semantic mapping into intelligent robotic systems that demand high degrees of interaction with real-world objects and environments.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Gaku Narita (3 papers)
Takashi Seno (2 papers)
Tomoya Ishikawa (6 papers)
Yohsuke Kaji (1 paper)

Citations (178)

View on Semantic Scholar