More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification (2008.05457v1)

Published 12 Aug 2020 in cs.CV and eess.IV

Abstract: Classification and identification of the materials lying over or beneath the Earth's surface have long been a fundamental but challenging research topic in geoscience and remote sensing (RS) and have garnered a growing concern owing to the recent advancements of deep learning techniques. Although deep networks have been successfully applied in single-modality-dominated classification tasks, yet their performance inevitably meets the bottleneck in complex scenes that need to be finely classified, due to the limitation of information diversity. In this work, we provide a baseline solution to the aforementioned difficulty by developing a general multimodal deep learning (MDL) framework. In particular, we also investigate a special case of multi-modality learning (MML) -- cross-modality learning (CML) that exists widely in RS image classification applications. By focusing on "what", "where", and "how" to fuse, we show different fusion strategies as well as how to train deep networks and build the network architecture. Specifically, five fusion architectures are introduced and developed, further being unified in our MDL framework. More significantly, our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs). To validate the effectiveness and superiority of the MDL framework, extensive experiments related to the settings of MML and CML are conducted on two different multimodal RS datasets. Furthermore, the codes and datasets will be available at https://github.com/danfenghong/IEEE_TGRS_MDL-RS, contributing to the RS community.

Authors (7)

Danfeng Hong (65 papers)
Lianru Gao (16 papers)
Naoto Yokoya (67 papers)
Jing Yao (56 papers)
Jocelyn Chanussot (89 papers)
Qian Du (50 papers)
Bing Zhang (435 papers)

Citations (875)

View on Semantic Scholar

Summary

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

"More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification" by Hong et al. explores the hypothesis that incorporating multiple modalities enhances the precision and robustness of classification models in remote sensing (RS) imagery. The work proposes a generic multimodal deep learning (MDL) framework that aims to mitigate the traditional limitations of single-modality deep learning classifiers when applied to complex RS scenes.

Overview

The research identifies three pivotal questions that influence the efficacy of multimodal frameworks: "what to fuse", "where to fuse", and "how to fuse". The authors introduce a comprehensive MDL framework, dubbed MDL-RS, that integrates five fusion strategies within deep networks, addressing pixel-level and spatial-spectral joint classification tasks. The experimentations are bifurcated into two primary architectures: Fully Connected Networks (FC-Nets) and Convolutional Neural Networks (CNNs).

Fusion Strategies

The five fusion methodologies explored are:

Early Fusion: Fusing modalities before the feature extraction stage.
Middle Fusion: Fusing midway through the feature extraction networks.
Late Fusion: Individual modalities are processed independently and fused before the classification stage.
Encoder-Decoder Fusion: Using an encoder to compress multi-modal information and a decoder to reconstruct representations.
Cross Fusion: Proposes a novel strategy where modalities interactively update shared network weights leading to more effective information blending.

Numerical Results

The quantitative results are compelling:

For the HS-LiDAR Houston2013 data, the best-performing strategy, cross fusion using CNNs, achieved an overall accuracy (OA) of 91.99%, a notable improvement over single-modality inputs.
In terms of numerical performance in the LCZ datasets (MS-SAR), cross fusion outperformed all other strategies, showcasing a higher OA and robustness particularly in cross-modality learning (CML) scenarios.

Practical and Theoretical Implications

The research has several implications:

Enhancing Classification Performance: The unified MDL framework shows superior performance across diverse and complex RS scenes.
Transferability: Cross fusion, in particular, exhibited higher adaptability in cross-modality contexts, where a modality is missing during inference.
Data Diversity: Leveraging heterogeneous data sources can lead to substantial improvements in feature representation and classification tasks.

Future Developments

The paper opens the door for further exploration:

Weakly-Supervised Learning: Introducing weak supervision or self-supervised methods to enhance MDL's performance with limited annotated datasets.
Scalability: Extending MDL frameworks to handle larger and more complex datasets typical in real-world applications.
Fusion Techniques: Improving and evolving fusion strategies further to maximize the intrinsic value of each modality.

In conclusion, the paper presents a robust and insightful advancement in the domain of multimodal deep learning for RS imagery classification. The proposed MDL-RS framework, with its innovative fusion strategies, particularly the cross-fusion method, demonstrates substantial improvements in handling complex scenes and enhancing classification accuracies. The research sets a foundational benchmark for future work in multimodal data fusion, potentially influencing wide-ranging applications in Earth observation and beyond.

PDF Markdown

Related Papers

Find Related Papers