UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery

Published 18 Sep 2021 in cs.CV | (2109.08937v4)

Abstract: Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

Abstract PDF Upgrade to Chat

Citations (478)

View on Semantic Scholar

Summary

The paper introduces UNetFormer, a hybrid CNN-Transformer model that balances local feature extraction with global context modeling for efficient urban segmentation.
It achieves 67.8% mIoU on UAVid, 52.4% mIoU on LoveDA, and operates up to 322.4 FPS, demonstrating superior performance against existing methods.
The innovative global-local attention mechanism and feature refinement head provide a scalable solution for real-time urban monitoring and future hybrid model research.

Overview of UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation

The paper presents UNetFormer, a hybrid neural network architecture designed to improve the efficiency and accuracy of semantic segmentation of remote sensing urban scenes. UNetFormer combines a lightweight CNN-based encoder (specifically ResNet18) with a Transformer-based decoder, featuring a novel global-local attention mechanism. The proposed model addresses limitations of traditional CNNs in capturing global contextual information, providing a robust solution for real-time semantic segmentation tasks in remote sensing applications.

Methodology and Innovations

The architecture of UNetFormer consists of:

CNN-based Encoder: Adopts ResNet18 for efficient local feature extraction, providing a balance between computational cost and feature richness.
Transformer-based Decoder: Utilizes a novel global-local Transformer block (GLTB). This block encompasses a dual-branch structure with a global branch to capture global contexts using window-based self-attention, and a local branch to preserve local details through convolutional operations. A cross-shaped window context interaction module is introduced to efficiently capture relations across windows.
Feature Refinement Head (FRH): Combines features from earlier and deeper layers, optimizing spatial details and semantic accuracy by closing the semantic gap.

Numerical Results and Evaluation

UNetFormer demonstrates significant improvements in inference speed and segmentation accuracy over existing state-of-the-art lightweight networks. It achieves:

67.8% mIoU on UAVid and 52.4% mIoU on LoveDA datasets, while operating at a high inference speed of up to 322.4 FPS on a NVIDIA GTX 3090 GPU.
Competitive performance with existing Transformer-based and hybrid networks: Achieving a remarkable F1 score of 91.3% and mIoU of 84.1% on the Vaihingen dataset with an alternative, more computation-heavy configuration using a Swin Transformer encoder.

Implications and Future Directions

The proposed UNetFormer architecture has significant implications for real-time urban monitoring and environmental applications, where both accuracy and processing speed are critical. By balancing the strengths of CNNs in local feature extraction and Transformers in global feature modeling, UNetFormer sets a precedent for future research in designing efficient hybrid architectures, particularly for geospatial tasks.

Further exploration could focus on advanced hybrid designs, optimizing Transformer components for even more efficient spatial detail capture without sacrificing real-time processing, and extending these principles to other domains where fine-resolution imagery is crucial.

Overall, UNetFormer exemplifies an effective integration of CNN and Transformer paradigms, challenging previous bottlenecks in remote sensing image segmentation and opening new avenues for practical applications in urban scene understanding.

Markdown