Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (1802.02611v3)

Published 7 Feb 2018 in cs.CV

Abstract: Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0\% and 82.1\% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at \url{https://github.com/tensorflow/models/tree/master/research/deeplab}.

Citations (11,881)

View on Semantic Scholar

Summary

The paper introduces DeepLabv3+, which fuses spatial pyramid pooling with an encoder-decoder structure to significantly refine object boundaries.
It leverages depthwise separable convolutions and a modified Xception backbone to reduce computation while maintaining high segmentation accuracy.
Experiments on PASCAL VOC 2012 and Cityscapes demonstrate state-of-the-art mIOU results, underlining its practical utility in real-world applications.

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

"Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation," authored by Liang-Chieh Chen et al., introduces DeepLabv3+, an enhanced model for semantic image segmentation tasks. The model advances earlier versions by improving both the computational efficiency and accuracy, particularly around object boundaries, which are crucial for precise segmentation.

Key Contributions

The paper highlights several pivotal contributions:

Combination of Spatial Pyramid Pooling (SPP) and Encoder-Decoder Structures: The proposed DeepLabv3+ model combines the advantages of SPP, which captures multi-scale contextual information, and the encoder-decoder structure, which helps recover spatial details. This fusion allows the model to maintain high-level semantic information while refining object boundaries.
Introduction of an Effective Decoder Module: The decoder in DeepLabv3+ is designed to be simple yet effective, improving upon the naive bilinear upsampling used in DeepLabv3. This module enhances segmentation performance by refining the details of object boundaries.
Utilization of Depthwise Separable Convolutions: The model leverages depthwise separable convolutions within the Atrous Spatial Pyramid Pooling (ASPP) and the decoder modules. This operation reduces computational overhead while maintaining performance, making the model both faster and more accurate.
Modified Xception Backbone: The authors adapt the Xception model, making it deeper and replacing max-pooling operations with depthwise separable convolutions with stride. This enhanced backbone further boosts segmentation accuracy and computational efficiency.

Experimental Results

The effectiveness of the proposed model is substantiated through extensive experiments on two major datasets: PASCAL VOC 2012 and Cityscapes.

PASCAL VOC 2012: DeepLabv3+ achieves an mIOU of 89.0% on the test set, setting a new performance benchmark without any post-processing. This demonstrates the model's capability in handling diverse and complex scenes effectively.
Cityscapes: On the Cityscapes dataset, DeepLabv3+ attains an mIOU of 82.1% on the test set, surpassing previous state-of-the-art methods. The improvements underscore the model's robustness and adaptability to urban street scene segmentation, which is challenging due to the high variability in object sizes and shapes.

Strengths and Claims

The paper makes several notable claims supported by empirical results:

Improved Boundary Accuracy: By integrating a decoder module that refines the segmentation map, DeepLabv3+ significantly enhances boundary accuracy, as demonstrated by a considerable improvement in mIOU in trimap experiments.
Computational Efficiency: Depthwise separable convolutions reduce the multiply-add operations, thus accelerating the computation without compromising the model's accuracy. The modified Xception backbone exemplifies this balance between speed and precision.
Flexibility in Feature Resolution: The use of atrous convolution in the encoder allows the extraction of feature maps at varying resolutions, adjustable based on computational resource constraints. This flexibility is particularly beneficial for deployment in resource-constrained environments.

Implications and Future Directions

The implications of this research extend to both theoretical and practical realms:

Theoretical Implications: The combination of SPP and encoder-decoder structures presents a robust framework for balancing contextual information and spatial resolution. It sets a precedent for future studies aiming to enhance semantic segmentation models.
Practical Implications: DeepLabv3+ demonstrates practical utility in real-world applications such as autonomous driving and medical image analysis, where precise boundary delineation is critical. The model's efficiency also makes it suitable for deployment on mobile and embedded devices.

Conclusion

The paper "Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation" significantly advances the field of semantic segmentation by presenting DeepLabv3+, a model that effectively combines SPP and encoder-decoder architectures, utilizes depthwise separable convolutions for efficiency, and adapts a powerful Xception backbone. The model's performance on benchmark datasets confirms its efficacy, and its design principles will likely inspire further innovations in semantic segmentation and related tasks.

PDF Markdown

Related Papers

GitHub

GitHub - tensorflow/models: Models and examples built with TensorFlow (76,707 stars)

YouTube

Show All Videos