When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery (2404.11797v1)
Abstract: Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of LLMs is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional ML and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.
- K. He, X. Chen, S. Xie et al., “Masked autoencoders are scalable vision learners,” in Proc. of the IEEE/CVF conf. on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
- J. Pathak, S. Subramanian, P. Harrington et al., “Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators,” arXiv preprint arXiv:2202.11214, 2022.
- R. Lam, A. Sanchez-Gonzalez, M. Willson et al., “Learning skillful medium-range global weather forecasting,” Science, vol. 382, no. 6677, pp. 1416–1421, 2023.
- Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Fourier neural operator for parametric partial differential equations,” arXiv preprint arXiv:2010.08895, 2020.
- X. Sun, P. Wang, W. Lu et al., “Ringmo: A remote sensing foundation model with masked image modeling,” IEEE Trans. on Geoscience and Remote Sensing, 2022.
- J. Jakubik, M. Muszynski, M. Vössing, N. Kühl, and T. Brunschwiler, “Toward foundation models for earth monitoring: Generalizable deep learning models for natural hazard segmentation,” in IEEE IGARSS 2023, 2023, pp. 5638–5641.
- P. Dias, A. Potnis, S. Guggilam, L. Yang, A. Tsaris, H. Medeiros, and D. Lunga, “An agenda for multimodal foundation models for earth observation,” in IGARSS 2023-2023 IEEE Intl. Geoscience and Remote Sensing Symposium. IEEE, 2023, pp. 1237–1240.
- J. Jakubik, S. Roy, C. Phillips, P. Fraccaro, D. Godwin, B. Zadrozny, D. Szwarcman, C. Gomes, G. Nyirjesy, B. Edwards et al., “Foundation models for generalist geospatial artificial intelligence,” arXiv preprint arXiv:2310.18660, 2023.
- J. Jakubik, L. Chu, P. Fraccaro et al., “Prithvi-100M,” Aug. 2023.
- GLAD, “Global forest change,” https://glad.umd.edu/dataset, 2024, accessed: 04/10/2024.
- P. G. Curtis, C. M. Slay, N. L. Harris et al., “Classifying drivers of global forest loss,” Science, vol. 361, no. 6407, pp. 1108–1111, 2018.
- Y. Xie, Z. Wang, G. Mai, Y. Li, X. Jia, S. Gao, and S. Wang, “Geo-foundation models: Reality, gaps and opportunities,” in Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems, 2023.
- F. Wei, Y. Gao, Z. Wu, H. Hu, and S. Lin, “Aligning pretraining for detection via object-level contrastive learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 682–22 694, 2021.
- L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” Advances in neural information processing systems, vol. 35, pp. 507–520, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021.
- D. Bonafilia, B. Tellman, T. Anderson, and E. Issenberg, “Sen1floods11: A georeferenced dataset to train and test deep learning flood algorithms for sentinel-1,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops, 2020, pp. 210–211.
- T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proc. of the 22nd acm sigkdd intl. conf. on knowledge discovery and data mining, 2016, pp. 785–794.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th into. conf., Munich, Germany, proc., part III 18. Springer, 2015, pp. 234–241.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. of the European conf. on computer vision (ECCV), 2018, pp. 801–818.
- M. Thomas, E. Tellman, D. E. Osgood et al., “A framework to assess remote sensing algorithms for satellite-based flood index insurance,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 2589–2604, 2023.