DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (1606.00915v2)

Published 2 Jun 2016 in cs.CV

Abstract: In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.

Citations (16,930)

View on Semantic Scholar

Summary

The paper introduces atrous convolution and ASPP to capture multi-scale context while preserving high-resolution feature maps for segmentation.
It achieves state-of-the-art performance, notably a 79.7% mIOU on PASCAL VOC 2012, by combining deep CNNs with fully connected CRFs.
The approach refines object boundaries effectively and lays the groundwork for future joint training of DCNNs and CRFs in end-to-end systems.

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

The paper "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs," authored by Chen et al., presents a comprehensive framework designed to address the problem of semantic image segmentation using deep learning techniques. The paper's primary contributions lie in the introduction of atrous convolution, the development of atrous spatial pyramid pooling (ASPP), and the integration of fully connected Conditional Random Fields (CRFs) for improved object boundary localization. This essay will provide a detailed summary of the paper and discuss its practical and theoretical implications, as well as future directions in AI.

Technical Contributions

Atrous Convolution: Atrous convolution, alternatively known as dilated convolution, is introduced as a mechanism to maintain high-resolution feature maps throughout Deep Convolutional Neural Networks (DCNNs). By inserting zeroes between filter weights, atrous convolution allows for greater control over the resolution of computed feature maps. This method enables an increase in the receptive field without introducing additional parameters or computational overhead.
Atrous Spatial Pyramid Pooling (ASPP): ASPP addresses the challenge of recognizing objects at multiple scales. Traditional approaches present multiple rescaled versions of an image to a network, which can be computationally expensive. ASPP, however, applies multiple atrous convolutions with different sampling rates in parallel. These layers capture multi-scale context effectively, enhancing the robustness of the segmentation network.
Fully Connected CRFs: To mitigate the loss of localization accuracy due to the invariance properties of DCNNs, the paper employs a fully connected CRF model as a post-processing step. This CRF approach leverages both appearance and spatial information to refine segmentation boundaries, capitalizing on mean field approximation for efficient inference.

Numerical Results

The DeepLab system demonstrates state-of-the-art performance across multiple datasets:

PASCAL VOC 2012: DeepLab achieves a mean Intersection over Union (mIOU) of 79.7% on the test set. This performance marks a significant improvement over earlier approaches and reflects the efficacy of atrous convolution and fully connected CRFs.
PASCAL-Context: The system achieves a mIOU of 45.7%, surpassing previous methods.
PASCAL-Person-Part: DeepLab attains a mIOU of 64.94%, outperforming contemporary approaches by a considerable margin.
Cityscapes: With a mIOU of 70.4%, DeepLab exhibits strong performance in urban scene understanding.

Practical and Theoretical Implications

The successful application of atrous convolution demonstrates its potential to recalibrate DCNN architectures for dense prediction tasks without substantial computational penalties. The introduction of ASPP further solves the pervasive issue of scale invariance in segmentation tasks. Combining these techniques with fully connected CRFs ensures that high-resolution predictions are achieved alongside precise object boundary delineation.

Future Directions

The results and methodologies proposed in this paper suggest several avenues for future research:

Joint Training of DCNNs and CRFs: While this work applies CRFs in a decoupled manner, further improvements could be realized by jointly training these components to facilitate end-to-end learning.
Enhanced Context Modules: The ASPP module could be expanded with additional context aggregation techniques, potentially improving the ability to handle diverse and complex visual scenes.
Adapting to Diverse Datasets: Future work could evaluate the generalizability of the DeepLab framework across a broader array of datasets, including those with less structured environments than those considered.

Conclusion

The paper "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs" represents a substantial advancement in the field of semantic segmentation. By integrating atrous convolution, ASPP, and fully connected CRFs, the authors have developed a robust system that significantly outperforms prior methods on several benchmarks. The architectural innovations and empirical results laid out in this paper provide a solid foundation for further exploration and enhancement in semantic image segmentation and related areas in computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos