A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images (2104.12137v6)

Published 25 Apr 2021 in cs.CV

Abstract: The fully convolutional network (FCN) with an encoder-decoder architecture has been the standard paradigm for semantic segmentation. The encoder-decoder architecture utilizes an encoder to capture multilevel feature maps, which are incorporated into the final prediction by a decoder. As the context is crucial for precise segmentation, tremendous effort has been made to extract such information in an intelligent fashion, including employing dilated/atrous convolutions or inserting attention modules. However, these endeavors are all based on the FCN architecture with ResNet or other backbones, which cannot fully exploit the context from the theoretical concept. By contrast, we introduce the Swin Transformer as the backbone to extract the context information and design a novel decoder of densely connected feature aggregation module (DCFAM) to restore the resolution and produce the segmentation map. The experimental results on two remotely sensed semantic segmentation datasets demonstrate the effectiveness of the proposed scheme.Code is available at https://github.com/WangLibo1995/GeoSeg

Citations (196)

View on Semantic Scholar

Summary

The paper introduces a transformer-based segmentation scheme leveraging Swin Transformer for enhanced context extraction in remote sensing images.
It employs a novel DCFAM with Shared Spatial and Channel Attention to aggregate multi-scale features effectively.
Experimental results show superior mean F1, Overall Accuracy, and mean IoU compared to traditional ResNet-based models.

A Novel Transformer-Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

The paper introduces an innovative approach for semantic segmentation of fine-resolution remote sensing images by leveraging a Transformer-based architecture, specifically employing the Swin Transformer as the foundational backbone. This research seeks to enhance the efficacy of context extraction and feature aggregation compared to traditional Fully Convolutional Networks (FCNs) that typically rely on a ResNet or similar backbone. By integrating the Swin Transformer, the authors aim to capitalize on its ability to model long-range dependencies more effectively than conventional convolution-based models.

Methodology and Architecture

The paper details a semantic segmentation architecture that revolves around the encoder-decoder paradigm. The encoder component of the architecture utilizes the Swin Transformer, which adeptly manages long-range dependencies by utilizing multi-head attention mechanisms. The Swin Transformer processes images in a hierarchical fashion, which culminates in multiple levels of feature extraction through various stages defined by distinct hyperparameters. These stages yield hierarchical features, labeled as ST1 through ST4, which serve as intermediate data representations necessary for the subsequent decoding phase.

The decoder phase is innovatively designed with the Densely Connected Feature Aggregation Module (DCFAM), which aims to refine and combine multi-scale features more effectively. DCFAM incorporates Shared Spatial Attention (SSA) and Shared Channel Attention (SCA), which enhance semantic contextual relationships across spatial and channel dimensions, respectively. The novel usage of dilated convolutions in the Large Field Upsample Connection further facilitates the capture of multi-scale contexts. This dense connectivity between hierarchical transformer features and multi-level integration marks a significant departure from standard practice.

Experimental Evaluation and Results

The authors conducted experiments on two benchmark datasets, ISPRS Vaihingen and Potsdam, to validate their approach. The performance of the DC-Swin model was compared against several state-of-the-art methods, including DeepLabV3+, PSPNet, and various ResNet-based architectures. The results demonstrated that the DC-Swin achieved superior performance in terms of mean F1-score, Overall Accuracy (OA), and mean Intersection over Union (mIoU) across both datasets. For example, on the Potsdam dataset, DC-Swin surpassed most contemporary models by achieving a mean F1 of 93.25%, an overall accuracy of 92.00%, and a mean IoU of 87.56%.

An ablation paper was also conducted to discern the impact of each component, revealing that Swin-S as a backbone produced noticeable improvements in segmentation accuracy compared to ResNet. Furthermore, the incorporation of DCFAM, along with its SSA and SCA components, was shown to enhance the model's performance significantly, verifying the efficacy of these innovations in handling complex segmentation challenges.

Implications and Future Directions

The introduction of the Transformer model, particularly the Swin Transformer, for the segmentation of fine-resolution remote sensing images signifies a crucial step in conjoining advances in NLP and CV. This research not only highlights the potential of Transformers in semantic segmentation tasks but also proposes a novel feature aggregation mechanism that could influence future work in the adaptation of Transformers for various computer vision applications.

Future research could explore the extension of this methodology to other forms of remote sensing data, such as multispectral or hyperspectral images, potentially adapting the DCFAM and Swin Transformer combination to handle additional spectral dimensions. Additionally, the integration of this segmentation scheme in real-time applications could also be a promising area, especially if optimizations tailored to computational efficiency are developed. Overall, this paper lays the groundwork for employing Transformer architectures in remote sensing image analysis, potentially accommodating a wider array of data sources and segmentation requirements.

PDF Markdown