Dense semantic labeling of sub-decimeter resolution images with convolutional neural networks (1608.00775v2)

Published 2 Aug 2016 in cs.CV

Abstract: Semantic labeling (or pixel-level land-cover classification) in ultra-high resolution imagery (< 10cm) requires statistical models able to learn high level concepts from spatial data, with large appearance variations. Convolutional Neural Networks (CNNs) achieve this goal by learning discriminatively a hierarchy of representations of increasing abstraction. In this paper we present a CNN-based system relying on an downsample-then-upsample architecture. Specifically, it first learns a rough spatial map of high-level representations by means of convolutions and then learns to upsample them back to the original resolution by deconvolutions. By doing so, the CNN learns to densely label every pixel at the original resolution of the image. This results in many advantages, including i) state-of-the-art numerical accuracy, ii) improved geometric accuracy of predictions and iii) high efficiency at inference time. We test the proposed system on the Vaihingen and Potsdam sub-decimeter resolution datasets, involving semantic labeling of aerial images of 9cm and 5cm resolution, respectively. These datasets are composed by many large and fully annotated tiles allowing an unbiased evaluation of models making use of spatial information. We do so by comparing two standard CNN architectures to the proposed one: standard patch classification, prediction of local label patches by employing only convolutions and full patch labeling by employing deconvolutions. All the systems compare favorably or outperform a state-of-the-art baseline relying on superpixels and powerful appearance descriptors. The proposed full patch labeling CNN outperforms these models by a large margin, also showing a very appealing inference time.

Authors (2)

Michele Volpi (13 papers)
Devis Tuia (81 papers)

Citations (468)

View on Semantic Scholar

Summary

Dense Semantic Labeling of Sub-Decimeter Resolution Images with Convolutional Neural Networks

Introduction

The paper discusses applying Convolutional Neural Networks (CNNs) to semantic labeling of ultra-high-resolution images, specifically aerial images with resolutions below 10 cm. This task requires classifying each pixel according to land cover or land use categories. Traditional methods rely on handcrafted features, whereas CNNs autonomously learn multi-level representations, enhancing their ability to manage significant appearance variations.

Methodology

The proposed approach utilizes a CNN architecture with a downsample-then-upsample strategy, which optimally predicts dense semantic labels for each pixel:

Downsampling: The network first abstracts high-level features using convolutions. This process creates a reduced spatial map that encodes complex patterns while retaining essential information.
Upsampling: The system employs deconvolutions to restore the original image resolution, ensuring spatial precision. This step involves learning deconvolution filters to interpolate higher-resolution predictions effectively, thus maintaining geometric accuracy.

This methodology achieves notable inference efficiency as it circumvents the need for multiple forward passes, producing results with a single forward computation.

Evaluation

The architecture was evaluated using two renowned datasets: the Vaihingen and Potsdam sub-decimeter resolution aerial image datasets. The CNN’s performance was assessed against a baseline using superpixels and appearance descriptors.

Key results include:

Vaihingen Dataset: The CNN achieved high numerical accuracy, with improved geometric precision. It outperformed baselines in both overall and class-specific accuracy metrics.
Potsdam Dataset: Similar high accuracy was observed, benefiting from the learned deconvolutions which effectively manage the increased spatial detail.

The paper demonstrates that the CNN-based method excels in both accuracy and computational efficiency compared to classical approaches.

Implications

The architecture's ability to efficiently handle dense semantic labeling at high resolution suggests significant potential for applications in remote sensing and urban planning. The paper highlights how deep learning can automate intricate imagery analysis tasks, enhancing both accuracy and efficiency.

Future Directions

Potential future developments include exploring the network's adaptability to varying resolutions and its application to different types of remote sensing data beyond high-resolution aerial imagery. Incorporating other modalities such as LiDAR data could further improve land cover classifications.

Conclusion

The paper presents a sophisticated CNN architecture optimized for dense semantic labeling in ultra-high-resolution aerial imagery. By employing a downsample-then-upsample strategy, the network capitalizes on deep learning’s capability to autonomously learn hierarchical representations, achieving significant advancements over traditional techniques in both efficiency and accuracy. This work paves the way for enhanced automation in geospatial analysis, with implications for diverse practical applications.

PDF Markdown