USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery (2312.02199v1)

Published 2 Dec 2023 in cs.CV, cs.AI, cs.LG, eess.IV, and stat.AP

Abstract: Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .

References (45)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces USat, a self-supervised vision transformer designed for multi-sensor satellite imagery that eliminates the need for extensive labeled data.
It adapts modified patch projection layers and positional encodings to integrate diverse spectral bands, achieving up to 8% improvement on benchmarks.
USat's flexible architecture supports arbitrary spectral band combinations, reducing computational loads and enhancing performance in low-data regimes.

In the field of satellite imaging, leveraging the vast amounts of data collected from Earth observation satellites is paramount for a multitude of applications ranging from agriculture and energy to disaster response and climate monitoring. A recent development in this field is the creation of a new encoder architecture known as USat, designed for multi-sensor satellite imagery. This architecture is particularly innovative because it is trained in a self-supervised manner, meaning it learns to interpret the data without the need for manually labeled datasets, which are often expensive and time-consuming to produce.

USat, developed by researchers at Stanford University, is a vision transformer adapted to accommodate multi-spectral data from multiple sensors. It achieves this by introducing modified patch projection layers and positional encodings, which allow it to process spectral bands with various spatial scales.

In more concrete terms, USat is integrated into a process named Masked Autoencoder (MAE) self-supervised pre-training procedure. This method trains the encoder to predict parts of the input image that are masked (hidden) based on the visible parts, thus learning a robust representation of the data. The primary benefit of USat compared to previous models is its ability to use an arbitrary collection of images with different sets of spectral bands and ground sampling distances, thus reducing computational loads and optimizing outcomes.

Experiments show that USat outperforms single sensor pre-training approaches in both image interpretation accuracy and efficiency in scenarios where labeled data is scarce. It demonstrates that leveraging data from multiple sensors can significantly boost the model's learning capacity compared to using just a single type of sensor data.

An especially compelling aspect of USat is its contribution to a self-supervised MAE model, which has been tested on various benchmark datasets. For instance, improvements of up to 8% on multiple remote sensing benchmark datasets and up to 7% in low data regimes were observed - an encouraging leap forward for the field.

Moreover, the USat architecture can support a flexible selection of spectral bands for fine-tuning the model, thereby increasing adaptability for varied practical applications. The researchers have also made sure that key resources, such as the code and pre-trained weights, are readily available to the public, encouraging further development and application of their work.

In summary, USat represents a step forward in the efficient and effective interpretation of multi-sensor satellite imagery. With its self-supervised learning methodology, it holds the promise of advancing geographical analysis and various crucial applications of satellite data, marking a significant stride toward autonomous and agile satellite image processing.

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery (2312.02199v1)

Summary

GitHub

Tweets

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery (2312.02199v1)

Summary

Related Papers

GitHub

Tweets