Dilated Residual Networks (1705.09914v1)

Published 28 May 2017 in cs.CV

Abstract: Convolutional networks for image classification progressively reduce resolution until the image is represented by tiny feature maps in which the spatial structure of the scene is no longer discernible. Such loss of spatial acuity can limit image classification accuracy and complicate the transfer of the model to downstream applications that require detailed scene understanding. These problems can be alleviated by dilation, which increases the resolution of output feature maps without reducing the receptive field of individual neurons. We show that dilated residual networks (DRNs) outperform their non-dilated counterparts in image classification without increasing the model's depth or complexity. We then study gridding artifacts introduced by dilation, develop an approach to removing these artifacts (`degridding'), and show that this further increases the performance of DRNs. In addition, we show that the accuracy advantage of DRNs is further magnified in downstream applications such as object localization and semantic segmentation.

Citations (1,547)

View on Semantic Scholar

Summary

The paper introduces DRNs that replace standard subsampling with dilated convolutions to maintain high-resolution feature maps.
It demonstrates improved performance in image classification, localization, and segmentation, with metrics showing DRN-A-18 reducing top-1 error to 28.00% from ResNet-18's 30.43%.
A novel degridding technique is employed to mitigate gridding artifacts, enhancing model accuracy and applicability across various computer vision tasks.

Dilated Residual Networks

The paper "Dilated Residual Networks" presents an innovative approach to convolutional neural network (CNN) design, specifically targeting the preservation of high spatial resolution in feature maps throughout the network's layers. This approach addresses the common issue wherein traditional CNNs progressively downsample the input image, leading to a significant loss of spatial detail which is crucial for high-accuracy image classification and other tasks requiring detailed scene understanding, such as object localization and semantic segmentation.

Key Contributions and Methods

The central contribution of the paper lies in the introduction of Dilated Residual Networks (DRNs). Starting from the ResNet architecture—a state-of-the-art model for image classification—the authors propose replacing standard subsampling operations with dilated convolutions. This adjustment maintains the receptive field of neurons while increasing the resolution of the network's output feature maps. The transformations ensure that the DRNs do not increase the depth or complexity compared to traditional residual networks.

To address potential issues introduced by dilation, such as gridding artifacts, the authors develop a "degridding" technique. This involves:

Elimination of early max-pooling layers,
Addition of extra convolutional layers with progressively smaller dilation factors,
Removal of residual connections in the newly added layers.

Empirical Results

Image Classification

DRNs demonstrate superior performance in image classification tasks:

ImageNet Classification: DRNs consistently outperform their non-dilated counterparts. For instance, DRN-A-18 yields a top-1 error rate of 28.00% in the 1-crop evaluation, compared to ResNet-18's 30.43%.
The degridded variant, DRN-C-26, maintained performance close to deeper models like ResNet-34, showing a top-1 error rate of 24.86% against ResNet-34's 27.73%.

Object Localization

In weakly-supervised object localization, the high-resolution activation maps provided by DRNs offer significant advantages:

Localization Accuracy: DRN-A models show improved localization compared to their ResNet counterparts. For example, DRN-A-50 achieves better top-1 and top-5 localization rates compared to ResNet-50.
The degrided DRN-C-26 model, in particular, outperforms deeper models like DRN-A-50 and ResNet-101 in top-1 localization error.

Semantic Segmentation

For semantic segmentation, which demands detailed pixel-level predictions:

Cityscapes Dataset: DRNs demonstrate remarkable performance, especially the degridded variants. For instance, DRN-C-42 surpasses ResNet-101 in mean Intersection over Union (IoU), achieving 70.9% compared to the baseline of 66.6%.
The presented models show clear qualitative and quantitative improvements, with cleaner and more accurate segmentations.

Implications and Future Directions

The DRN architecture holds significant implications for numerous computer vision tasks:

Theoretical Insights: The work underscores the importance of maintaining high spatial resolution in feature maps for improved accuracy in image classification and understanding.
Practical Applications: DRNs can be directly applied to multiple downstream tasks without substantial modifications, simplifying the adaptation process for applications such as object localization and segmentation.

Further research could explore optimizing the DRN architecture for other vision tasks, such as depth estimation and video analysis. Additionally, integrating DRNs with other advanced structural designs (e.g., attention mechanisms) might yield further performance enhancements.

Conclusion

This paper offers a critical advancement in CNN architecture design with Dilated Residual Networks, providing a method to maintain high spatial resolution and achieve better performance across a range of tasks. The empirical results strongly support the potential of DRNs for detailed scene analysis, establishing a solid foundation for future developments in the field.

PDF Markdown