The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation (1611.09326v3)

Published 28 Nov 2016 in cs.CV

Abstract: State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of the model and, optionally, (c) a post-processing module (e.g. Conditional Random Fields) to refine the model predictions. Recently, a new CNN architecture, Densely Connected Convolutional Networks (DenseNets), has shown excellent results on image classification tasks. The idea of DenseNets is based on the observation that if each layer is directly connected to every other layer in a feed-forward fashion then the network will be more accurate and easier to train. In this paper, we extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining. Moreover, due to smart construction of the model, our approach has much less parameters than currently published best entries for these datasets. Code to reproduce the experiments is available here : https://github.com/SimJeg/FC-DenseNet/blob/master/train.py

Authors (5)

Simon Jégou (8 papers)
Michal Drozdzal (45 papers)
David Vazquez (73 papers)
Adriana Romero (23 papers)
Yoshua Bengio (601 papers)

Citations (1,541)

View on Semantic Scholar

Summary

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

Summary

The paper, titled "The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation," explores the application of Densely Connected Convolutional Networks (DenseNets) for the task of semantic segmentation. The authors, Simon Jégou et al., extend the DenseNet architecture, initially successful in image classification, to semantic segmentation, which involves classifying each pixel of an image into a predefined class.

Contributions

The contributions of this paper can be encapsulated as follows:

Extension of DenseNet to Fully Convolutional Networks (FCNs) for Semantic Segmentation: The paper adapts DenseNet to FCNs by adding an upsampling path. This sophisticated architecture addresses the problem of spatial resolution loss, inherent in typical CNN architectures due to pooling layers.
Upsampling Path Construction to Mitigate Feature Map Explosion: The authors propose a refined upsampling pathway that incorporates dense blocks while controlling for the exponential increase in the number of feature maps, thereby enabling computational tractability.
State-of-the-Art Performance with Significant Fewer Parameters: The proposed models, particularly FC-DenseNet103, outperform existing methods on standard urban scene segmentation benchmarks (CamVid, Gatech) without requiring pretraining or post-processing modules. The network achieves these results with a substantial reduction in the number of parameters compared to other state-of-the-art models.

Architecture Overview

DenseNet Recap

DenseNets exploit dense connectivity among layers. Each layer receives inputs from all preceding layers and passes on its feature maps to all subsequent layers, resulting in improved parameter efficiency, implicit deep supervision, and feature reuse across the architecture.

Fully Convolutional DenseNets (FC-DenseNets)

In FC-DenseNets, the core DenseNet architecture forms the downsampling path. This path compresses the image spatially while increasing the depth through successive dense blocks and pooling operations. For upsampling, a specialized transition up module is employed. This transposed convolution layer is crucial for scaling up feature maps to the original input resolution. To handle the potential explosion of feature maps, only feature maps from the immediately preceding dense block are upsampled. Skip connections between symmetric layers in the downsampling and upsampling paths are implemented to ensure that spatially detailed information is preserved.

Experimental Evaluation

CamVid Dataset

Three FC-DenseNet models (56, 67, and 103 layers) were evaluated on the CamVid dataset. FC-DenseNet103 yielded a mean Intersection-over-Union (IoU) of 66.9% and a global accuracy of 91.5%, surpassing the performance of existing architectures like the FCN8, DeepLab, and Dilation8. Notably, the FC-DenseNet variants significantly improved the performance on less represented classes within the CamVid dataset, demonstrating robustness against class imbalance.

Gatech Dataset

For the Gatech dataset, FC-DenseNet103, fine-tuned from CamVid, achieved a global accuracy of 79.4%, outperforming state-of-the-art models that utilize 3D convolutions capturing temporal information. This emphasizes the potential of the proposed architecture in handling datasets with rich temporal redundancy, even without specialized temporal processing units.

Implications and Future Directions

The primary theoretical implication of this work is the demonstration of DenseNet's applicability to segmentation tasks, integrating thorough parametrization, efficient depth utilization, and improved feature reuse. Practically, this advancement paves the way for more resource-efficient yet highly effective segmentation networks that do not require extensive pretraining or complex post-processing methodologies.

Future research directions could explore the incorporation of temporal information directly into the DenseNet framework to enhance video segmentation performance. Moreover, leveraging transfer learning from large-scale datasets such as ImageNet or domain-specific synthetic datasets could further refine the segmentation accuracy and generalization capabilities of FC-DenseNets.

Conclusion

This paper establishes the feasibility and advantages of using DenseNet architectures for semantic segmentation by extending them to fully convolutional networks. The proposed FC-DenseNet models achieve state-of-the-art performance on significant benchmarks with substantially fewer parameters, setting a new standard for efficiency and accuracy in semantic segmentation tasks. The success of these models underscores the potential of dense connectivity patterns in facilitating deep learning advancements for complex pixel-level predictions.

PDF Markdown

Related Papers

GitHub

GitHub - SimJeg/FC-DenseNet: Fully Convolutional DenseNets for semantic segmentation. (486 stars)

Tweets

https://twitter.com/bilaltwovec/status/1851076785749827620