Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views (2010.01191v3)

Published 2 Oct 2020 in cs.CV

Abstract: We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose (via localization sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length x width x feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the neural episodic memories and spatio-semantic allocentric representations build by SMNet for subsequent tasks in the same space - navigating to objects seen during the tour("Find chair") or answering questions about the space ("How many chairs did you see in the house?"). Project page: https://vincentcartillier.github.io/smnet.html.

Citations (69)

View on Semantic Scholar

Summary

The paper introduces SMNet, a novel method that projects egocentric RGB-D features into allocentric semantic maps using a multi-module neural architecture.
It employs an egocentric visual encoder, feature projector, spatial memory tensor, and map decoder to integrate and decode rich semantic information.
Experiments on Matterport3D demonstrate significant gains, with improvements up to 16.81% in mean IoU and 19.69% in Boundary-F1, enhancing embodied navigation.

Overview of "Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views"

The paper "Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views" presents a novel approach to the problem of semantic mapping within the context of embodied agents, such as robots or egocentric AI assistants. This work introduces Semantic MapNet (SMNet), a method designed to construct allocentric top-down semantic maps from egocentric observations captured by an RGB-D camera with known pose. Key components of SMNet include an Egocentric Visual Encoder, a Feature Projector, a Spatial Memory Tensor, and a Map Decoder. Notably, SMNet leverages projective geometry in combination with neural representation learning to achieve significant performance improvements over existing baselines on the Matterport3D dataset, demonstrating absolute gains ranging from 4.01% to 16.81% in mean Intersection-over-Union (IoU) and 3.81% to 19.69% in Boundary-F1 metrics.

Methodology

SMNet operates through a multi-module architecture:

Egocentric Visual Encoder: Utilizes RedNet, an efficient network architecture, to encode egocentric RGB-D frames into feature maps that capture semantic information from the environment.
Feature Projector: Projects egocentric features to allocentric locations in a spatial memory using known camera parameters, while dealing efficiently with egocentric perspective constraints.
Spatial Memory Tensor: This accumulates information over time, integrating features using a GRU for updating memory cells, ultimately storing aggregated knowledge of the explored space.
Map Decoder: Decodes the accumulated feature-rich memory into a semantic top-down map, allowing for reliable spatial understanding.

Experimental Evaluation

Experiments conducted with SMNet employed the Matterport3D dataset, chosen due to its extensive semantic annotations and suitability for multi-room navigation within large-scale environments. The dataset facilitated precise evaluation of the model's aptitude for producing semantic maps that are both metric and detailed. The evaluation metrics included accuracy, mean recall, mean precision, mean IoU, and Boundary-F1, with SMNet outperforming baseline models in critical aspects of top-down semantic segmentation tasks.

Results

The experiments confirmed SMNet's robust performance, especially in recognizing and segmenting small objects that are often challenging due to occlusions or limited perspectives from egocentric views. Qualitative results demonstrated the method's capability to overcome the limitations of edge-bound labeling errors that plague Segment-then-Project methods. SMNet's architecture, which projects features instead of direct labels or pixels, proved effective in reducing such 'label splatter.'

Implications and Future Work

SMNet's ability to build intuitive, allocentric maps with high semantic fidelity holds substantial implications for a variety of downstream tasks in embodied AI. Conceptually, the architecture advances techniques in semantic SLAM environments, facilitating tasks such as spatio-semantic reasoning, object navigation, and question answering about the environment. Future developments might explore enhancing instance-awareness in semantic maps, enabling the segmentation of individual objects instead of broader categories, thereby addressing limitations observed in instances of overlapped objects.

In sum, the development of SMNet represents a significant step towards the realization of intelligent agents capable of constructing and utilizing rich, reusable spatial representations of their environment, auguring well for the future of autonomous systems in complex environments.

PDF Markdown

Related Papers

GitHub

SMNet
GitHub - vincentcartillier/Semantic-MapNet (77 stars)