SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery (2405.02512v2)

Published 3 May 2024 in cs.CV and cs.AI

Abstract: Recent advancements in foundation models have significantly impacted various fields, including natural language processing, computer vision, and multi-modal tasks. One area that stands to benefit greatly is Earth observation, where these models can efficiently process large-scale, unlabeled geospatial data. In this work we extend the SwinMAE model to integrate temporal information for satellite time-series data. The architecture employs a hierarchical 3D Masked Autoencoder (MAE) with Video Swin Transformer blocks to effectively capture multi-scale spatio-temporal dependencies in satellite imagery. To enhance transfer learning, we incorporate both encoder and decoder pretrained weights, along with skip connections to preserve scale-specific information. This forms an architecture similar to SwinUNet with an additional temporal component. Our approach shows significant performance improvements over existing state-of-the-art foundation models for all the evaluated downstream tasks: land cover segmentation, building density prediction, flood mapping, wildfire scar mapping and multi-temporal crop segmentation. Particularly, in the land cover segmentation task of the PhilEO Bench dataset, it outperforms other geospatial foundation models with a 10.4% higher accuracy.

References (41)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/CSVisionPapers/status/1787798787269824617

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery (2405.02512v2)

Summary

Related Papers

Tweets