SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (1511.00561v3)

Published 2 Nov 2015 in cs.CV, cs.LG, and cs.NE

Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.

Citations (14,909)

View on Semantic Scholar

Summary

The paper introduces a novel deep CNN architecture that uses max-pooling indices to efficiently upsample feature maps for pixel-wise segmentation.
It demonstrates competitive accuracy and superior boundary localization compared to FCN and DeconvNet, as shown on benchmarks like CamVid and SUN RGB-D.
SegNet’s low memory and computational requirements enable practical real-time applications in fields such as autonomous driving and augmented reality.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Overview

The paper "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation," authored by Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, presents a robust deep convolutional neural network (CNN) specifically designed for semantic pixel-wise segmentation. SegNet diverges from other segmentation networks by leveraging a novel decoding process that utilizes pooling indices from the encoder stage to perform efficient and effective upsampling.

Architecture and Methodology

The SegNet architecture comprises an encoder network and a corresponding decoder network, which are followed by a final pixel-wise classification layer. The encoder network is structurally identical to the 13 convolutional layers of the well-established VGG16 network, omitting the fully connected layers to maintain higher resolution feature maps. The decoder network is the innovative aspect of SegNet, wherein the decoder upsampled its lower resolution input feature maps using pooling indices computed from the max-pooling layers of the encoder stage. These indices are stored efficiently and enable non-linear upsampling without the need for extensive learning.

This unique architecture allows SegNet to map low-resolution encoder feature maps to high-resolution decoder feature maps, crucial for detailed pixel-wise classification. The design is particularly optimized for scene understanding applications such as autonomous driving, ensuring memory and computational efficiency during inference.

Comparative Analysis

The paper includes a detailed comparison with other prominent architectures such as Fully Convolutional Networks (FCN) and DeconvNet. The authors construct several variants of SegNet and FCN to methodically evaluate performance trade-offs between decoder designs. Notable SegNet advantages include:

Efficient memory usage during inference by storing max-pooling indices instead of feature maps.
Competitive performance in segmentation accuracy, particularly in boundary delineation.

The analysis reveals that SegNet achieves superior accuracy in boundary localization due to its structured method of upsampling with pooling indices, compared to the FCN approach of learning upsampling weights.

Benchmarking Results

SegNet's performance is benchmarked on two critical datasets: the CamVid road scene dataset and the SUN RGB-D indoor scene dataset. On the CamVid dataset, SegNet demonstrates high global accuracy and mean intersection over union (mIoU), outperforming both traditional machine learning approaches and other deep learning models in most classes, particularly excelling at delineating fine structures such as road signs and pedestrians.

In the SUN RGB-D dataset, a more complex indoor scene segmentation task with 37 classes, SegNet maintains competitive performance. Despite the inherent challenges posed by indoor scenes, SegNet achieves reasonable segmentation accuracy across varied object classes.

Practical Implications and Future Work

SegNet's low memory footprint and computational efficiency make it well-suited for real-time applications in autonomous driving and augmented reality. This efficiency does not come at the cost of segmentation performance, as evidenced by its competitive results in extensive benchmarks.

Future research directions may include exploring the integration of additional modalities such as depth information, further optimizing the trade-offs between computational efficiency and segmentation accuracy, and extending the architecture to support even larger and more diverse datasets.

Conclusion

The SegNet architecture presents a practical and efficient solution for semantic image segmentation, demonstrating strong performance and low computational resource requirements. Its innovative use of max-pooling indices for upsampling sets a precedent for future designs in semantic segmentation networks. By providing detailed comparative analysis and robust benchmarking, the authors pave the way for more efficient and accurate segmentation models tailored to real-world applications.

PDF Markdown