Understanding Convolution for Semantic Segmentation (1702.08502v3)

Published 27 Feb 2017 in cs.CV

Abstract: Recent advances in deep learning, especially deep convolutional neural networks (CNNs), have led to significant improvement over previous semantic segmentation systems. Here we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are of both theoretical and practical value. First, we design dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates what we call the "gridding issue" caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a state-of-art result of 80.1% mIOU in the test set at the time of submission. We also have achieved state-of-the-art overall on the KITTI road estimation benchmark and the PASCAL VOC2012 segmentation task. Our source code can be found at https://github.com/TuSimple/TuSimple-DUC .

Authors (7)

Panqu Wang (14 papers)
Pengfei Chen (52 papers)
Ye Yuan (274 papers)
Ding Liu (52 papers)
Zehua Huang (1 paper)
Xiaodi Hou (6 papers)
Garrison Cottrell (11 papers)

Citations (1,579)

View on Semantic Scholar

Summary

The paper presents Dense Upsampling Convolution (DUC) to learn upscaling filters that enhance fine-grained pixel predictions compared to bilinear upsampling.
It introduces Hybrid Dilated Convolution (HDC) to eliminate gridding effects and ensure a dense, gap-free receptive field.
Experimental results show state-of-the-art mIoU scores on Cityscapes and PASCAL VOC2012, improving segmentation accuracy for small objects.

Understanding Convolution for Semantic Segmentation

Overview

The paper "Understanding Convolution for Semantic Segmentation" by Wang et al. presents two significant advancements in convolutional operations for semantic segmentation tasks: Dense Upsampling Convolution (DUC) and Hybrid Dilated Convolution (HDC). The primary aim of these enhancements is to improve the accuracy and efficiency of pixel-level semantic segmentation, which is critical in applications such as autonomous driving and image understanding.

Dense Upsampling Convolution (DUC)

DUC is introduced to address the limitations of bilinear upsampling commonly used in semantic segmentation systems. Bilinear upsampling, though computationally inexpensive, is not learnable and often fails to capture fine details necessary for accurate pixel-wise predictions. DUC, inspired by techniques in image super-resolution, applies convolution directly to feature maps to generate high-resolution label maps. This method involves learning upscaling filters that divide the label map into subparts and upscale the feature maps into a dense pixel-wise prediction map. The DUC method is end-to-end trainable within the Fully Convolutional Network (FCN) framework and significantly captures and recovers detailed information, especially beneficial for small objects.

Hybrid Dilated Convolution (HDC)

HDC aims to address the "gridding" problem inherent in standard dilated convolution operations. The gridding issue arises when the receptive field introduced by dilated convolutions causes sparse sampling, leading to loss of local information and inconsistency. HDC employs a range of dilation rates within the same spatial resolution to ensure the receptive field fully covers the area without gaps. This method effectively enlarges the receptive field without adding extra modules and improves the network's ability to recognize larger objects while maintaining local detail integrity.

Experimental Results

The proposed methods were evaluated extensively on several datasets, including Cityscapes, KITTI road segmentation, and PASCAL VOC2012. The combined DUC and HDC approach achieved the following:

Cityscapes Dataset: Achieved a state-of-the-art mean Intersection-over-Union (mIoU) of 80.1% on the test set. The results demonstrate that the proposed methods significantly outperform baselines and other recent methods, particularly in identifying small objects and maintaining fine details.
KITTI Road Segmentation: Attained state-of-the-art performance with the highest maximum F1-measure and average precision across multiple road scene categories, despite the limited training data.
PASCAL VOC2012: Achieved an mIoU of 83.1% on the test set using a single model without model ensemble or multiscale testing, highlighting the method's robustness and generalizability.

Theoretical and Practical Implications

The introduction of DUC and HDC provides a comprehensive approach to improving semantic segmentation. Theoretically, these methods offer a new perspective on handling the trade-offs between receptive field size and resolution. Practically, they provide a framework that can be applied to various segmentation tasks with minimal adjustments, enhancing model performance and efficiency.

Future Developments

Future research could explore further refinements of DUC and HDC, including their integration with other advanced network architectures and techniques. Additionally, extending these methods to three-dimensional data and exploring their applications in medical imaging and other domains could be highly beneficial.

In summary, the paper presents substantial contributions to the field of semantic segmentation, with DUC and HDC showing significant improvements over existing methods. These advancements highlight the potential for continued innovation in improving the accuracy and capability of deep learning models in computer vision tasks.

PDF Markdown

Related Papers

GitHub

GitHub - TuSimple/TuSimple-DUC: Understanding Convolution for Semantic Segmentation (613 stars)