Self-supervised Audiovisual Representation Learning for Remote Sensing Data (2108.00688v2)

Published 2 Aug 2021 in cs.CV

Abstract: Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated datasets and the wide diversity of sensing platforms impedes similar developments. In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks. By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation. For this purpose, we introduce the SoundingEarth dataset, which consists of co-located aerial imagery and audio samples all around the world. Using this dataset, we then pre-train ResNet models to map samples from both modalities into a common embedding space, which encourages the models to understand key properties of a scene that influence both visual and auditory appearance. To validate the usefulness of the proposed approach, we evaluate the transfer learning performance of pre-trained weights obtained against weights obtained through other means. By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery. The dataset, code and pre-trained model weights will be available at https://github.com/khdlr/SoundingEarth.

Citations (40)

View on Semantic Scholar

Summary

The paper presents a novel self-supervised approach that leverages correlated audio and imagery to pre-train deep networks for remote sensing without labeled data.
It designs the SoundingEarth dataset with co-located aerial images and audio samples and uses a batch triplet loss to align multimodal embeddings effectively.
Experiments on aerial image classification and segmentation tasks demonstrate superior performance over traditional ImageNet pre-training methods.

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

The paper "Self-supervised Audiovisual Representation Learning for Remote Sensing Data" proposes a novel approach to pre-train deep neural networks for remote sensing applications. This research leverages the correlation between geo-tagged audio recordings and remote sensing imagery to enhance representation learning without the need for labeled datasets. The unique contribution lies in introducing a self-supervised learning methodology that harnesses audiovisual data for pre-training, thus addressing the challenge posed by the lack of large annotated datasets in remote sensing.

Methodology and Dataset

The researchers have designed the SoundingEarth dataset, comprising co-located aerial imagery and audio samples from diverse geographical locations. By utilizing this dataset, they pre-train ResNet models to project samples from both modalities into a shared embedding space. This strategy is intended to foster the learning of significant features that impact both the visual and auditory characteristics of a scene. The framework is trained with a bespoke batch triplet loss that aligns with traditional triplet loss but incorporates batch-wise learning to fully exploit the available data in each batch.

Evaluation and Results

The paper provides an empirical evaluation of the proposed approach by transferring the learned representations to several downstream tasks, such as aerial image classification and segmentation. The experiments conducted on widely recognized datasets—UC Merced Land Use, NWPU-RESISC45, and AID—demonstrate that the proposed pre-training methodology surpasses existing methods, including conventional ImageNet pre-training and other self-supervised techniques, in terms of classification accuracy after a few epochs.

Moreover, in segmentation tasks, the efficacy of the learned features is corroborated by the performance on the DeepGlobe Land Cover Classification dataset. The results illustrate the ability of the pre-trained weights to improve the segmentation quality significantly over random or ImageNet-initialized weights. Additionally, the audiovisual scene classification task on the ADVANCE dataset further demonstrates the utility of these pre-trained models when dealing with multimodal data.

Implications and Future Directions

The introduction of the SoundingEarth dataset and the associated pre-training framework has several implications for both theoretical research and practical applications in the field of remote sensing. The self-supervised approach mitigates the dependency on labeled datasets, thus facilitating broader applicability across various sensing modalities that do not fit neatly into the RGB framework provided by datasets like ImageNet. The integration of audio data as a supplementary modality could lead to new insights in environmental monitoring and geospatial analysis.

Moving forward, the framework could be extended to handle additional modalities such as synthetic-aperture radar or hyperspectral imagery, which are prevalent in remote sensing but currently underexplored in multimodal self-supervised learning. Furthermore, the presented research opens avenues for more sophisticated multimodal learning architectures that can better exploit the spatial and temporal correlations of remote sensing data streams. The availability of the SoundingEarth dataset and pre-trained models also sets a foundation for exploring other geospatial tasks, including multimodal retrieval and change detection, using self-supervised representation learning.

PDF Markdown

Related Papers

GitHub

GitHub - khdlr/SoundingEarth: Self-supervised Audiovisual Representation Learning for Remote Sensing Data (31 stars)

YouTube

Show All Videos