Onboard Satellite Image Classification for Earth Observation: A Comparative Study of ViT Models (2409.03901v3)

Published 5 Sep 2024 in cs.CV and eess.SP

Abstract: This study focuses on identifying the most effective pre-trained model for land use classification in onboard satellite processing, emphasizing achieving high accuracy, computational efficiency, and robustness against noisy data conditions commonly encountered during satellite-based inference. Through extensive experimentation, we compare the performance of traditional CNN-based, ResNet-based, and various pre-trained vision Transformer models. Our findings demonstrate that pre-trained Vision Transformer (ViT) models, particularly MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in terms of accuracy and efficiency. These models achieve high performance with reduced computational requirements and exhibit greater resilience during inference under noisy conditions. While MobileViTV2 has excelled on clean validation data, EfficientViT-M2 has proved more robust when handling noise, making it the most suitable model for onboard satellite EO tasks. Our experimental results demonstrate that EfficientViT-M2 is the optimal choice for reliable and efficient RS-IC in satellite operations, achieving 98.76 % of accuracy, precision, and recall. Precisely, EfficientViT-M2 delivers the highest performance across all metrics, excels in training efficiency (1,000s) and inference time (10s), and demonstrates greater robustness (overall robustness score of 0.79). Consequently, EfficientViT-M2 consumes 63.93 % less power than MobileViTV2 (79.23 W) and 73.26 % less power than SwinTransformer (108.90 W). This highlights its significant advantage in energy efficiency.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that ViT models, notably EfficientViT-M2, balance accuracy and efficiency for onboard satellite image classification.
The study reveals that pre-trained ViT architectures outperform CNNs while maintaining robust performance under Gaussian noise and motion blur.
The findings indicate that EfficientViT-M2 is optimal for Earth observation tasks, offering strong performance with reduced computational demands.

Onboard Satellite Image Classification for Earth Observation: A Comparative Study of ViT Models

Introduction

The paper focuses on remote sensing image classification (RS-IC), a crucial element in Earth observation (EO) systems. With the rapid advancement of Transformer-based architectures, Vision Transformers (ViTs) have taken the forefront, surpassing the performance of traditional CNNs and exhibiting high accuracy, computational efficiency, and robustness against noise. This paper evaluates various ViT models, particularly focusing on their application in land use classification for onboard satellite processing.

Machine Learning Models Compared

The paper evaluates a range of models, starting from conventional CNNs and ResNets to advanced ViT architectures. ViTs especially stand out due to their superior performance stemming from pre-trained models.

Pretrained ViT Models: EfficientViT-M2, MobileViTV2, and SwinTransformer were included in the analysis, with MobileViTV2 achieving the highest accuracy but at a relatively higher computational cost.
Models Trained from Scratch: CNNs, ResNet-14, Compact Transformer (CCT), and SmallViT were tested, with ResNet showing notable performance among those trained without pre-training.

The models were assessed not only on performance metrics like accuracy, precision, and recall but also on computational efficiency parameters including inference time and power consumption.

Figure 1: Statistical comparison for model performance.

Computational Efficiency

EfficientViT-M2 emerged as the optimal model due to its lean architecture, offering a significant reduction in computational load compared to other ViT models. It balances accuracy and resource usage, with notably lower power consumption during inference compared to its counterparts, making it highly suitable for onboard satellite systems.

Figure 2: Statistical comparison for power consumption during inference.

Robustness Under Noisy Conditions

The robustness of the models was meticulously tested under various noise conditions. Gaussian noise and motion blur, common in satellite operations, were applied to test datasets:

Gaussian Noise: EfficientViT-M2 showed superior resilience, maintaining higher robustness scores across all severity levels when compared to MobileViTV2.
Motion Blur: EfficientViT-M2 again led in handling noise, particularly at higher severity levels.

Figure 3: MobileViT (Top) and EfficientViT (Bottom) robustness with Gaussian noisy inference data.

Discussion and Conclusion

The paper concludes that EfficientViT-M2 is the most suitable model for onboard satellite image classification tasks, given its balanced trade-off between performance, robustness, and computational demands. MobileViTV2, while performing slightly better in accuracy on clean data, is surpassed by EfficientViT-M2 in noisy environments and efficiency metrics.

The implications of this research extend to real-world applications in EO missions, where resource constraints require models that are both efficient and robust. The findings underscore the potential of Transformer-based models in enhancing remote sensing tasks, not merely in classification accuracy but in operational practicality within environmental constraints.

In future developments, integrating these ViT models in multitask frameworks could further push the boundaries of EO systems, providing greater utility and insight by simultaneously learning multiple applications.