Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis (2011.06961v3)

Published 13 Nov 2020 in cs.CV, cs.LG, and cs.RO

Abstract: Analyzing scenes thoroughly is crucial for mobile robots acting in different environments. Semantic segmentation can enhance various subsequent tasks, such as (semantically assisted) person perception, (semantic) free space detection, (semantic) mapping, and (semantic) navigation. In this paper, we propose an efficient and robust RGB-D segmentation approach that can be optimized to a high degree using NVIDIA TensorRT and, thus, is well suited as a common initial processing step in a complex system for scene analysis on mobile robots. We show that RGB-D segmentation is superior to processing RGB images solely and that it can still be performed in real time if the network architecture is carefully designed. We evaluate our proposed Efficient Scene Analysis Network (ESANet) on the common indoor datasets NYUv2 and SUNRGB-D and show that we reach state-of-the-art performance while enabling faster inference. Furthermore, our evaluation on the outdoor dataset Cityscapes shows that our approach is suitable for other areas of application as well. Finally, instead of presenting benchmark results only, we also show qualitative results in one of our indoor application scenarios.

PDF Abstract

Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis

The paper "Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis" introduces a novel framework for RGB-D semantic segmentation tailored for efficient real-time performance on mobile robotic platforms. The proposed method, named the Efficient Scene Analysis Network (ESANet), leverages both RGB and depth data inputs for enhanced indoor scene understanding, addressing the complexity of multi-modal data while optimizing computational efficiency.

The authors emphasize the advantages of incorporating depth information alongside traditional RGB data, particularly in cluttered indoor environments where RGB data might fall short due to lack of depth cues. The ESANet employs a dual encoder-decoder architecture with specifically designed branches for processing RGB and depth data separately, followed by a strategic fusion mechanism. The depth data provide complementary geometric information that enhances feature representation when combined with RGB data.

A distinguishing aspect of ESANet is its encoder design, which is based on the ResNet architecture. However, to improve efficiency without sacrificing performance, the authors replace the conventional ResNet basic blocks with Non-Bottleneck-1D (NBt1D) blocks, which utilize factorized convolutions to reduce computational load. This modification supports faster inference while maintaining or improving accuracy. The network also includes a context module inspired by PSPNet for aggregating multi-scale feature information, an addition that positively impacts network performance by increasing the receptive field.

In the decoder phase, ESANet incorporates a learned upsampling technique that outperforms traditional bilinear interpolation and avoids artifacts introduced by transposed convolutions. This is achieved by initializing the weights to replicate bilinear interpolation and allowing the network to tune these parameters during training.

Through extensive evaluation on commonly used datasets, NYUv2 and SUNRGB-D, ESANet achieves state-of-the-art segmentation performance. It balances accuracy with efficiency, demonstrating real-time inference capabilities on embedded hardware, such as the NVIDIA Jetson AGX Xavier, achieving speeds significantly higher than competing approaches while maintaining comparable or superior mean intersection over union (mIoU) scores.

Notably, ESANet also successfully extends to the Cityscapes dataset, demonstrating its versatility across different environmental contexts. It selectively utilizes depth data from synthetic disparities illustrating generalizability to outdoor scenarios, though with differing levels of depth data fidelity.

The paper also presents an ablation paper dissecting the contribution of each architectural component, including the depth incorporation strategy, context module, skip connections, and the proposed upsampling, reinforcing the rationale behind ESANet's design choices.

The practical implications of ESANet are far-reaching, enhancing mobile robotics in complex indoor environments. By integrating ESANet into robotic systems, tasks such as obstacle detection, semantic mapping, and enhanced object perception can be achieved more effectively and efficiently, leveraging the enriched semantic segmentation outputs. The framework also paves the way for further research on optimizing semantic segmentation models, particularly for applications with stringent computational constraints like mobile robotics and autonomous systems. As AI continues to integrate with robotics, ESANet's contributions highlight a significant step towards more intelligent, context-aware robotic systems capable of real-time environmental interpretation.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Daniel Seichter (5 papers)
Mona Köhler (5 papers)
Benjamin Lewandowski (2 papers)
Tim Wengefeld (2 papers)
Horst-Michael Gross (17 papers)

Citations (191)

View on Semantic Scholar

Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis (2011.06961v3)

Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis

Related Papers