Remote Sensing Image Change Detection with Transformers (2103.00208v3)

Published 27 Feb 2021 in cs.CV

Abstract: Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve this, we express the bitemporal image into a few tokens, and use a transformer encoder to model contexts in the compact token-based space-time. The learned context-rich tokens are then feedback to the pixel-space for refining the original features via a transformer decoder. We incorporate BIT in a deep feature differencing-based CD framework. Extensive experiments on three CD datasets demonstrate the effectiveness and efficiency of the proposed method. Notably, our BIT-based model significantly outperforms the purely convolutional baseline using only 3 times lower computational costs and model parameters. Based on a naive backbone (ResNet18) without sophisticated structures (e.g., FPN, UNet), our model surpasses several state-of-the-art CD methods, including better than four recent attention-based methods in terms of efficiency and accuracy. Our code is available at https://github.com/justchenhao/BIT\_CD.

Citations (761)

View on Semantic Scholar

Summary

The paper presents the Bitemporal Image Transformer (BIT) which efficiently models spatial-temporal context for improved change detection in remote sensing images.
It leverages semantic tokenization to reduce redundancy by capturing high-level abstract representations through a compact set of tokens.
Experimental results demonstrate that BIT outperforms conventional CNN-based methods by achieving higher accuracy with lower computational costs.

Remote Sensing Image Change Detection with Transformers

The paper "Remote Sensing Image Change Detection with Transformers" introduces a novel approach to change detection (CD) in high-resolution remote sensing images by leveraging transformer architectures. The complexity of objects within a given scene and the variations in imaging conditions have traditionally posed challenges for CD tasks, even with the powerful feature extraction capabilities of convolutional neural networks (CNNs). This paper proposes the Bitemporal Image Transformer (BIT) to address these challenges by effectively modeling context within the spatial-temporal domain.

Key Contributions

Bitemporal Image Transformer (BIT): The primary innovation is BIT, which efficiently models the spatial-temporal context in remote sensing images. Unlike traditional methods that rely heavily on convolutions with limited receptive fields, BIT introduces a token-based approach, representing high-level change concepts through a small number of semantic tokens.
Semantic Tokenization: By expressing bitemporal images as token sets, the model reduces redundancy and focuses on high-level abstract representations. Each token captures significant contextual relations, allowing the transformer to operate in a compact yet information-dense space.
Transformer Architecture: The transformer encoder within BIT models dependencies across the token sets in space-time, learning rich semantic relations. The use of a simple transformer decoder then projects these enhanced semantic tokens back into pixel-space to refine original feature maps.
Efficiency and Performance: The BIT-based model achieves higher accuracy than purely convolutional models, with significantly lower computational costs and model parameters. Results on multiple datasets demonstrate its capability to surpass recent state-of-the-art attention-based CD methods, both in efficiency and accuracy.

Strong Numerical Results

The proposed BIT-based model consistently outperforms several benchmarks:

The model improves F1-scores notably over recent methods like STANet and IFNet across multiple datasets.
Notably, the model achieves this with a CNN backbone as simple as ResNet18, foregoing more sophisticated network designs like FPN or UNet, emphasizing the efficacy of the transformer architecture.

Practical and Theoretical Implications

Practical Implications:

The introduction of transformer architectures in remote sensing expands its utility in accurately detecting and analyzing changes in land cover or usage.
This approach promises efficiency improvements in processing high-resolution images, thus potentially reducing operational costs and enhancing decision-making in urban planning, deforestation monitoring, and disaster management.

Theoretical Implications:

BIT's token-based context modeling offers a new perspective on feature extraction, suggesting that semantic abstraction can be beneficial in other domains facing similar challenges of visual complexity and redundancy.
The research demonstrates the applicability of transformers beyond traditional language processing tasks, providing a foundation for further exploration in other computer vision applications.

Future Developments

Looking forward, the methodology outlined offers several avenues for innovation. Integrating BIT with larger transformer models and combining with other machine learning techniques (such as graph-based methods) could yield even more robust models. Furthermore, adaptation to a wider range of remote sensing data types, including multispectral or hyperspectral images, might extend its applicability. Finally, exploring unsupervised or semi-supervised variants could reduce dependency on annotated data, making the technology broadly accessible.

In conclusion, by addressing the limitations of conventional methods through a transformer-based approach, this paper marks a meaningful development in the field of remote sensing image analysis, offering a scalable and efficient solution to high-resolution image change detection.

PDF Markdown